Physical Design Process
The Physical Design Process is a critical phase in database management that follows the Logical Design. It translates abstract models and relationships into a concrete database structure that can handle real-world operations effectively. This section describes the various components and considerations in physical design. There is a shift in focus. Instead of understanding the nature of data, the focus is on optimizing how that data is stored, accessed, and retrieved.
Organizations gather vast amounts of data. A robust physical design is essential for optimal performance and efficient data management. Each segment of the process plays a vital role in creating a well-structured database system. This includes establishing naming conventions. It also involves optimizing for performance through aggregation and partitioning large datasets. By carefully implementing strategies like indexing and clustering, database architects can significantly enhance query speeds and overall system responsiveness.
The main objective is to develop a physical schema. It should meet the requirements of the logical design. Additionally, it must align with the operational demands and patterns of the workloads they will encounter in practice. Understanding these elements is crucial for DBAs and system architects. They need this knowledge to build resilient, scalable, and efficient database systems.
- Definition: Physical design converts the logical data model into an actual database structure optimized for performance and storage.
- Focus shifts from what data is → how data is stored and accessed.
- Standards (foundation): Establish naming conventions, design rules, and consistency across the system.
- Example: Standard naming for tables and columns across all warehouse modules.
- Aggregates plan (performance optimization): Precompute summary data to avoid expensive queries on raw data.
- Example: Store daily sales totals instead of recalculating from millions of transactions each time.
- Partitioning (manage large data): Split large tables into smaller parts for better performance and maintenance.
- Types: Horizontal (rows), Vertical (columns)
- Example: Partition sales data by date (e.g., monthly tables).
- Clustering (data locality): Store related data physically close together to reduce I/O.
- Example: Store records of the same customer together for faster retrieval.
- Indexing strategy (query speed): Create indexes on frequently queried columns.
- Example: Index on
customer_idto speed up lookups.
- Example: Index on
- Storage design (physical layout): Decide how data is stored on disk, including files, blocks, and staging areas.
- Example: Separate storage for staging data vs production warehouse tables.
- Final output: A physical schema that defines how the database operates efficiently in real systems.
- Key point: Logical design defines structure, but physical design ensures performance. By organizing, indexing, and storing data in a way that matches real query and workload patterns.
Objectives of Physical Design
- Performance optimization (primary goal): Ensure fast query response by reducing scan volume and improving I/O efficiency.
- Example: Using indexes and aggregates to avoid scanning millions of rows.
- Scalability: Maintain performance as data size and user load increase.
- Example: Designing partitions so new data can be added without slowing queries.
- Storage management: Organize data efficiently across storage to improve usage and manageability.
- Example: Splitting large fact tables into partitions across disks.
- Efficient space usage: Tune DBMS parameters (blocks, files) to avoid wasted storage.
- Ease of administration: Simplify backup, recovery, and data loading operations.
- Example: Backing up partitions instead of entire massive tables.
- Operational manageability: Well-designed structures make maintenance tasks easier and faster.
- Hardware-level detail: Define file sizes, data types, keys, and storage layout.
- Key point: Physical design is about making the data warehouse fast. It should be scalable and manageable in real-world conditions. It is not just about being structurally correct.
Storage Optimization
- Goal of storage optimization: Improve performance by efficiently moving data between disk and memory.
- Data block (I/O unit): Data is read/written in blocks, not individual rows.
- Impact: Entire block is loaded even if only one record is needed.
- Block size matters: Larger blocks store more rows → fewer disk reads → better performance for large scans.
- Example: Reading 1 large block instead of multiple small blocks speeds up warehouse queries.
- Warehouse vs staging area:
- Staging area: Used for data loading and transformation, not user query performance.
- Warehouse storage: Directly affects query speed and user experience.
- RAID (storage reliability + performance): Uses multiple disks to improve fault tolerance and throughput.
- Key RAID techniques:
- Mirroring: Same data written to two disks for reliability.
- Duplexing: Similar to mirroring but with separate controllers.
- Parity: Extra data for error detection and recovery.
- Striping: Data spread across disks for faster access.
- RAID levels (0–5): Balance between performance, storage efficiency, and fault tolerance.
- Example: RAID 0 improves speed, RAID 5 adds fault tolerance with parity.
- Key point: Storage optimization improves performance by reducing disk I/O. Techniques like larger blocks and RAID make data access faster and more reliable.
Indexing in the Data Warehouse
- Purpose of indexing: Speeds up data retrieval in large warehouse tables.
- More indexes than OLTP: Analytical queries use filters and aggregations across large datasets.
- Trade-off:
- More indexes → faster queries
- But → slower data loading (indexes must be updated)
- Example: Bulk loading data takes longer if many indexes exist.
- B-Tree index (default type):
- Hierarchical structure with fast lookups
- Works best for high-selectivity columns (many unique values)
- Example: Index on
customer_idfor quick record lookup
- Bitmap index (warehouse-friendly):
- Best for low-selectivity columns (few distinct values)
- Uses bit patterns to represent values
- Example: Gender column (Male/Female) or region category
- Why bitmap works well in warehouses: Many dimension attributes have repeated values → efficient storage and fast filtering
- Fact table indexing:
- Usually uses composite keys (combination of foreign keys)
- May require explicit B-Tree index creation
- Order of composite keys matters: Affects how efficiently queries can access data
- Example: Index
(date, product)vs(product, date)changes query performance
- Example: Index
- Key point: Indexing improves query speed. It must be carefully balanced with load performance. Tailor it to data characteristics, such as high vs low selectivity.
Partitioning: Purpose and Importance
- Purpose: Partitioning splits very large tables into smaller pieces. This ensures that queries scan only the relevant data instead of the whole table. This action reduces query processing time. It improves performance. This is especially useful when only part of a fact table is actively needed.
- Manageability benefit: Smaller partitions make loading, backup, recovery, archiving, and purging much easier to handle.
- Efficiency gain: Query redirection or partition pruning can skip irrelevant partitions, which saves I/O and speeds up response time.
- Key point: Partitioning improves both performance and administration. It lets the warehouse work on just the data that matters. This avoids treating every table as one huge block.
Horizontal Partitioning
- Horizontal partitioning: Splits a table by rows.
- In a fact table: Different groups of transactions are stored in different physical partitions. Fact tables are usually very large, and many queries only need a subset of the rows.
- Key point: Instead of treating one huge table as a single block, horizontal partitioning divides it into row-based slices. This way, queries can work on less data.
Partitioning by Time into Equal Segments
- Partitioning by time splits a fact table into fixed periods such as months, quarters, or years. This strategy makes the data easier to manage and query.
- It is intuitive because many business questions are time-based, such as monthly sales, quarterly profit, or yearly reports.
- It also helps with housekeeping, because old partitions can be archived or deleted when they are no longer needed.
- The drawback is that business activity is not always evenly spread across time.
- For example, a retailer may have huge peaks during Christmas or Easter, while quieter months hold much less data.
- If every time partition is the same size, some partitions may waste space. Others may become too large or heavily loaded.
- So time-based partitioning is simple and natural, but it is not always the most storage-efficient choice.
Partitioning by Time into Different-Sized Segments
- Partitioning by time into different-sized segments keeps recent data in smaller, more detailed partitions and older data in larger partitions.
- This is intuitive. Recent business activity is usually queried more often. Therefore, it makes sense to keep the latest data easier to access.
- For example, the last three months might be stored separately by month. The previous quarter might be stored as one partition. Older data could be stored as one larger half-year partition.
- The advantage is that current detailed data stays online without needing extra aggregation. The total number of physical partitions stays manageable.
- The disadvantage is that the partition structure must be adjusted regularly as data ages.
- Repartitioning large amounts of data takes time and effort, so the performance benefits must be worth the maintenance cost.
Partitioning on a Different Dimension
- Partitioning on a different dimension means splitting the fact table by a stable business category. Categories can include region, product group, or supplier. This is done instead of partitioning by time.
- This is useful when most queries focus on that one dimension. Those queries can skip irrelevant rows. As a result, they run faster.
- For example, if each regional team mostly queries its own region, partitioning by region makes those lookups much more efficient.
- The advantage is that queries tied to that dimension are faster across all time periods.
- The drawback is that cross-region or cross-category analysis may need to read many partitions, which reduces the benefit.
- Another risk is that if the business definition of the dimension changes, the whole table may need to be reorganized.
- This method is a good choice when the dimension is very stable. It should also be unlikely to change over the life of the warehouse.
Partitioning by Size of Table
- Partitioning by table size creates new partitions. This occurs when the current one hits a fixed size limit. It happens regardless of time or business category.
- This is useful when no natural dimension like time or region fits the access patterns. This is particularly true in 24/7 continuous operations. These operations have a steady transaction flow.
- The downside is high management complexity, because the system needs detailed metadata to track which data lives in which partition.
- Unlike time or region partitioning, size-based partitions have no intuitive business meaning, making them harder to understand and query against.
- This approach is a last resort when no stable, meaningful partitioning dimension exists.
Partitioning Dimension Tables
- Dimension tables are usually small and don’t need partitioning, unlike massive fact tables.
- They only require partitioning when they grow very large due to many members or historical changes over time.
- For dimension partitioning, use a business grouping within the dimension itself, not time.
- Example: A huge product dimension with millions of records could be split by product group or department.
- Choose a hierarchy level that creates roughly the right number of partitions.
- Rare cases only: First check if the dimension has accidentally embedded fact data that’s making it grow too large.
- Key point: Dimensions stay small by design, so partitioning is unusual and usually signals a design issue elsewhere.
Round-Robin Partitions
- Round-robin partitioning recycles physical partitions by archiving the oldest data and reusing that space for new incoming data.
- This happens when the warehouse hits its full historical capacity and needs to keep adding fresh data.
- Metadata makes this work: Users see logical names like “current month” or “last week.” Meanwhile, the physical storage underneath gets reused.
- The big advantage is automation: The same partition structure gets recycled repeatedly without creating new ones constantly.
- The main problem occurs when partition sizes vary a lot. Overflow handling (what happens when new data doesn’t fit) becomes tricky to manage in such cases.
- Key intuition: It’s like a circular conveyor belt for storage. It is efficient for space but requires careful size management to avoid complications.
Vertical Partitioning
- Vertical partitioning splits a table by columns rather than rows, effectively dividing each row into separate tables.
- This is useful when rows have too many fields or when different applications need different subsets of columns.
- Example: Keep frequently used customer fields (name, ID, status) in one table and rarely used fields (detailed history) in another.
- There are two main approaches. The first is normalization where you separate tables to avoid update issues. The second approach is row splitting that involves moving rarely used columns to a secondary table.
- The goal is to make the main table smaller. It will be faster to read. Queries only load the columns they actually need.
- Key limitation: Works best when queries rarely need to join the split tables back together.
- If joins are frequent, the performance gain disappears because of join overhead.
- Best use case: Drill-down scenarios where detailed data is accessed only occasionally.
- Key point: Vertical partitioning makes “wide” tables narrower and faster by separating hot (frequently used) from cold (rarely used) columns.
Hardware Partitioning and MPP
- Hardware partitioning: Design the warehouse to fully use available processors, disks, and I/O paths to avoid bottlenecks.
- Core goals: Maximize CPU power, disk throughput, and I/O efficiency simultaneously.
- Parallel query execution: Split large queries into multiple threads that run simultaneously across processors.
- Example: One query processor handles customer data while another processes product data.
- Shared architecture: Multiple threads share processing power efficiently without needing one CPU per thread.
- MPP (Massively Parallel Processing): Distribute data across multiple nodes, with each node handling its portion.
- Data striping: Split large tables horizontally across nodes so each owns a segment.
- Queries run in parallel across relevant nodes, reducing total processing time.
- Best for large volumes: Small datasets don’t benefit due to coordination overhead.
- Key point: Modern warehouses treat big queries like orchestra sections. They split the work across many players (processors/nodes) to finish faster. They do this instead of using just one conductor.
Aggregation and Summary Tables
- Aggregation creates summary tables to improve query performance by precomputing common analytical calculations instead of recalculating them every time.
- Most warehouse queries analyze subsets of data across multiple dimensions rather than scanning all detailed transaction records.
- Example: Join millions of sales records. Find monthly totals by region and product category. Then, store those totals in a summary table.
- This saves processing time for repeated analytical questions that combine time, geography, customer segments, and product groupings.
- The trade-off is maintenance cost: Summary tables must be refreshed when underlying data changes.
- Example: A marketing summary table of meat-eating customers by brand preference, spending level, and region avoids repeated complex joins.
- Key point: Aggregation shifts computation cost from query time to maintenance time, making common analyses much faster.
Designing Summary Tables
- The summary table should keep any dimensions that are not changed by the aggregation. This ensures that useful analytical detail is not lost.
- Related aggregated values can be combined in one table. They are often queried together. Examples include weekly sales, highest daily sale, and lowest daily sale.
- Facts often analyzed as a set include sales amount, cost, profit, and week-over-week variation. These should be stored together. This ensures efficient analysis.
- The aggregation level must be chosen carefully because once detail is removed, it cannot be recovered from that summary table.
- Too many summary tables should be avoided because each one adds storage and maintenance overhead.
- A practical design limit is about 250 to 300 summary tables.
- Summary tables can sometimes be built at a slightly lower level than the requested analysis level to preserve some flexibility.
- Natural or intelligent keys should be used where they make sense. For example, use a customer number for a customer-based summary.
- Time should be included explicitly because most warehouse analysis depends on time-based reporting.
- Summary tables should also be indexed so that query performance is not reduced by slow access paths.
- Key point: A good summary-table design retains the useful dimensions. It combines related measures and avoids unnecessary tables. It also incorporates time and indexing into the design.
Responsibility and Design Discipline in Physical Design
- Physical design must match the actual query patterns, storage limits, and available hardware capabilities.
- Every choice, like aggregation or partitioning, affects storage space. Indexing or storage layout impacts loading time. These choices add to maintenance effort or administration work.
- The goal is to improve query performance without making operations too expensive or complex to manage.
- Prefer stable partitioning dimensions that won’t need frequent changes over volatile ones that cause repartitioning problems.
- Indexes should speed up queries enough to justify the extra loading and maintenance time they require.
- Aggregations should accelerate common analytical queries without creating too many summary tables that need constant updates.
- Use hardware parallelism and distributed storage only when data volumes are large enough to benefit from it.
- Key point: Physical design is about finding the right balance. It requires faster queries and manageable operational costs. This is based on real workload patterns and future growth.
Identifying the Partition Key
- Choosing the partition key is critical in physical design. A wrong choice requires expensive repartitioning later.
- The partition key must match actual user query patterns and business usage, not just technical convenience.
- A fact table may have multiple possible partition keys like region, date, or product category. However, only one should be chosen.
- Example: If users mostly query their own region, partition by region so regional queries scan only relevant data.
- Wrong choice example: Partitioning by date when users need regional data forces queries to scan multiple date partitions unnecessarily.
- The key principle is to minimize irrelevant partition scans for the most common and important business queries.
- Partitioning works effectively only when it mirrors how users actually access and analyze the data.
- Key point: The partition key must align with dominant business access paths to deliver performance gains without wasted scans.
Materialized Views
- Conventional (virtual) view: Only the definition is stored; results are computed live during every query execution.
- Problem: Complex views cause slow response times in data warehouses due to large data volumes and expensive analytical queries.
- Materialized view: Precomputed results are physically stored in the database, acting like a high-performance cache.
- Performance benefit: Avoids repeating expensive joins and aggregations for frequently used analytical results.
- Trade-off: Materialized views must be refreshed whenever underlying data changes, adding maintenance cost.
- Design principle: Create materialized views only for the most important, expensive, and frequently queried results.
- OLAP value: Dramatically reduces response time for common analytical workloads by precomputing what users need most.
- Key point: Materialized views trade storage and maintenance cost for much faster query performance on repeated analytical patterns.
Definition and Structure of Materialized Views
- Conventional (virtual) view: Only the definition is stored; results are computed live during every query execution.
- Problem: Complex views cause slow response times in data warehouses due to large data volumes and expensive analytical queries.
- Materialized view: Precomputed results are physically stored in the database, acting like a high-performance cache.
- Performance benefit: Avoids repeating expensive joins and aggregations for frequently used analytical results.
- Trade-off: Materialized views must be refreshed whenever underlying data changes, adding maintenance cost.
- Design principle: Create materialized views only for the most important, expensive, and frequently queried results.
- OLAP value: Dramatically reduces response time for common analytical workloads by precomputing what users need most.
- Key point: Materialized views trade storage and maintenance cost for much faster query performance on repeated analytical patterns.
Materialized Views with Aggregates
- Materialized views with aggregates precompute summarized results like totals, averages, and counts to avoid expensive query-time calculations.
- Analytical queries typically need grouped summaries across dimensions rather than individual transaction details.
- For efficient updates, the SELECT must include all GROUP BY columns plus COUNT(*) and COUNT(column) functions.
- Materialized view logs on base tables enable fast refresh when underlying data changes.
- Supported aggregate functions include SUM, COUNT(*), COUNT(x), AVG, VARIANCE, STDDEV, MIN, and MAX.
- This approach eliminates repeated full-table scans and joins for common warehouse summarisation patterns.
- Key point: Aggregate materialized views store the grouped results users query most often, trading storage for much faster analytical performance.
Materialized Views Containing Only Joins
- Materialized views with only joins precompute expensive table joins without any aggregation functions.
- In data warehouses, joins between large fact tables and dimension tables are costly and frequently repeated.
- Storing the joined result eliminates repeated join processing for common analytical queries.
- These views refresh after any DML operation (INSERT, UPDATE, DELETE) on base tables.
- Refresh options: ON COMMIT (automatic at transaction commit) or ON DEMAND (manual trigger).
- Requirements for efficient refresh: Include ROWID columns from all tables in the SELECT list.
- Materialized view logs must exist on all base tables.
- Remote tables limitation: All must be at the same site; ON COMMIT refresh not supported for remote tables.
- Key point: Join-only materialized views save time on repeated expensive joins while supporting fast refresh through ROWID tracking and logs.
Nested Materialized Views
- Nested materialized views are built on top of other materialized views (and possibly base tables).
- They enable multi-level precomputation where higher-level summaries depend on lower-level ones.
- Primary use: Rollups across multiple dimensions without repeating expensive base joins.
- Example: First materialize fact-dimension joins, then create monthly/quarterly/yearly aggregates on top.
- Benefit: Avoids recomputing the same base joins multiple times for different aggregation levels.
- Maintenance efficiency: Changes propagate incrementally through the hierarchy rather than rebuilding everything.
- Key point: Nested views create efficient aggregation hierarchies by reusing shared lower-level computations.
View Materialization Example
- Two analytical queries both join Sales with dimension tables (Product, Location) then aggregate differently.
- First query: total sales by product category after Product-Sales join.
- Second query: total sales by state after Location-Sales join.
- Both repeat expensive Sales joins and low-level grouping.
- Solution: Materialize TOTALSALES view grouping Sales by pid (product id), locid (location id), and total sales amount.
- Higher-level queries now join TOTALSALES with Product/Location instead of raw Sales.
- Benefit: Single precomputed aggregation serves multiple downstream summaries.
- Key point: Intermediate materialized views accelerate query families sharing common subcomputations, not just single queries.
Role of Materialized Views in Warehouse Performance
Materialised views shift expensive computations from query time to precomputation, drastically cutting response times for repetitive OLAP workloads.
Core Benefits
They excel in data warehouses where analytical queries repeatedly perform heavy joins and aggregations on predictable patterns. Pre-storing results enables instant access without recalculating. This improves usability for multidimensional analysis like sales rollups.
Key Trade-offs
Maintenance overhead requires refresh mechanisms and materialized view logs to track base table changes. Storage costs grow with materialized data volume. Poor choices amplify these costs without proportional query speed gains.
Selective Design Principle
Materialize only high-reuse, expensive subqueries—like shared Sales aggregations serving category and location summaries. This minimizes redundancy while maximizing acceleration across query families. Avoid over-materialization that burdens storage and refresh cycles.
Q&A
Part A: Rapid Recall Q and A
1. What is physical design in a data warehouse?
Physical design is the process of converting the logical model into an implementation-ready structure that works efficiently on actual hardware, storage media, operating systems, and DBMS software.
2. What is the main concern of physical design?
The main concern of physical design is operational efficiency, not just structural correctness.
3. What are the major goals of physical design?
The major goals are performance improvement, scalability, storage efficiency, and ease of administration.
4. Why are standards required in physical design?
Standards ensure consistency in naming, design practices, and implementation across the warehouse.
5. Why is an aggregates plan necessary?
An aggregates plan is necessary because many warehouse queries ask for summary information, and precomputed summaries reduce repeated large-scale computation.
6. What is clustering?
Clustering is the storage of related data together in the same physical block so that they can be fetched in fewer I/O operations.
7. What is the basic purpose of indexing?
The basic purpose of indexing is to speed up data retrieval.
8. What is the purpose of partitioning?
Partitioning improves performance, simplifies management, and supports easier backup and recovery.
9. What are the two main types of partitioning?
The two main types are horizontal partitioning and vertical partitioning.
10. What is horizontal partitioning?
Horizontal partitioning divides a table by rows.
11. What is vertical partitioning?
Vertical partitioning divides a table by columns.
12. What is the safest basis for fact table partitioning?
Time is generally the safest basis for fact table partitioning.
13. Why is time-based partitioning considered safe?
It is considered safe because calendar groupings are unlikely to change during the life of the warehouse.
14. What is partitioning by size of table?
It is a method in which a new partition is created whenever a table reaches a predefined size threshold.
15. What is row splitting?
Row splitting is vertical partitioning in which less frequently used columns are moved to a separate table.
16. What is a B-Tree index best suited for?
A B-Tree index is best suited for highly selective columns.
17. What is a bitmap index best suited for?
A bitmap index is best suited for low-selectivity columns with few distinct values.
18. Why are bitmap indexes useful in data warehouses?
They are useful because many warehouse attributes have low cardinality and are frequently used in multidimensional queries.
19. What is aggregation?
Aggregation is the precomputation and storage of summarized data to speed up analytical queries.
20. What is a materialized view?
A materialized view is a stored result of a query that is physically maintained in the database for faster access.
Part B: Conceptual Q and A
21. Why is physical design different from logical design?
Logical design focuses on what data exists and how data elements are related. Physical design focuses on how that data will be stored, accessed, indexed, partitioned, and processed efficiently in the real system. Logical design is structure-oriented, whereas physical design is performance-oriented.
22. Why can a logically correct warehouse still perform poorly?
A logically correct warehouse may still perform poorly if the physical structures are weak. Large fact tables may be stored without partitioning, summary queries may run on raw detailed data, indexes may be missing or badly chosen, and storage may be arranged inefficiently. Therefore, logical correctness alone does not guarantee acceptable query performance.
23. Why is partitioning so important in a data warehouse but less critical in smaller transactional systems?
A data warehouse stores very large fact tables and supports analytical queries that often scan broad ranges of data. Without partitioning, every major query, load, backup, and recovery operation may involve the full table. Partitioning breaks the table into manageable segments so that only relevant data is scanned or maintained. This becomes critical when data size is very large.
24. Why is choosing the partition key a strategic decision?
The partition key determines how data is distributed and which partitions must be searched during query execution. If the chosen key matches the dominant access pattern, most queries touch only a small portion of the table. If the key is poorly chosen, queries may need to scan many partitions, reducing the benefit of partitioning. A wrong key may later force expensive reorganisation of the fact table.
25. Why might partitioning by region be better than partitioning by transaction date?
Partitioning by region is better when most users query only data from their own region. In that case, each user accesses only the relevant partition. If the table were partitioned by transaction date instead, data from all regions for a given time period would be mixed together, and regional users would need to query across multiple partitions to isolate their own data.
26. Why is partitioning by size treated as a fallback method?
Partitioning by size is used only when no stable and meaningful business dimension is available. It does control growth, but it is not business-intuitive and requires metadata to identify which data resides in which partition. For this reason, it is more difficult to manage and less natural than time-based or dimension-based partitioning.
27. Why is vertical partitioning not always a good idea?
Vertical partitioning reduces row width and can improve access when only some columns are frequently needed. However, if most queries require columns from both partitions, the system must perform frequent joins to reconstruct the full row. In that case, the intended performance benefit may be lost.
28. Why are too many indexes undesirable even in a data warehouse?
Indexes improve retrieval speed, but they increase storage use and slow down data loading and maintenance. Every inserted or updated row may require corresponding index updates. Therefore, indexing must be selective and workload-driven.
29. Why is a bitmap index not the default choice for every column?
A bitmap index is effective only when a column has low cardinality. If a column has many distinct values, bitmap structures become less efficient. In such cases, B-Tree indexing is generally more suitable.
30. Why is aggregation essential for decision support queries?
Decision support queries often ask for summaries across time, location, customer groups, product categories, and other dimensions. If every such query scans detailed records from scratch, response time becomes unacceptable. Aggregation reduces the processing burden by storing commonly required summaries in advance.
31. Why should summary tables not be created in excessive numbers?
Each summary table improves some queries but also increases storage, refresh, and maintenance cost. If too many summary tables are created, operational complexity and overhead rise sharply. Therefore, summary tables must be created only where the performance gain justifies the maintenance effort.
32. Why is a conventional view less suitable in an OLAP environment?
A conventional view stores only the query definition. Its result must be computed at runtime whenever it is queried. In OLAP environments, where queries often involve large joins and aggregations over massive data sets, this runtime computation can produce very high response times.
33. Why are materialised views useful in a warehouse?
Materialised views store precomputed query results. They are useful because expensive joins and aggregations do not need to be repeated every time a query is issued. This reduces response time and improves analytical usability.
34. What is the main disadvantage of materialised views?
The main disadvantage is maintenance overhead. When the base tables change, the materialised view must be refreshed to remain consistent.
35. Why can nested materialised views be more efficient than separate repeated aggregation?
Nested materialised views allow multiple higher-level summaries to be built on top of a previously materialised lower-level result. This avoids repeating the same base join or lower-level computation several times.
Part C: Further Q and A
36. Distinguish between logical design and physical design.
Logical design defines data entities, relationships, and overall structure. Physical design defines storage layout, indexing, partitioning, clustering, and implementation details needed for performance and manageability.
37. Distinguish between horizontal and vertical partitioning.
Horizontal partitioning divides a table by rows and is mainly used to reduce scan volume in large tables. Vertical partitioning divides a table by columns and is used to reduce row width when not all attributes are frequently needed together.
38. Distinguish between partitioning by time and partitioning by dimension.
Partitioning by time divides data according to time periods and is generally stable and safe. Partitioning by dimension divides data according to a business category such as region or product group and is useful only when that grouping is heavily used and unlikely to change.
39. Distinguish between B-Tree index and bitmap index.
A B-Tree index is suitable for highly selective columns with many distinct values. A bitmap index is suitable for low-selectivity columns with a small number of repeated values.
40. Distinguish between aggregation and materialisation.
Aggregation refers specifically to storing summarised data such as totals, averages, or grouped results. Materialisation is broader and refers to storing the result of a query, which may include joins, aggregates, or both.
41. Distinguish between a conventional view and a materialised view.
A conventional view stores only the query definition and computes results at runtime. A materialised view stores the result physically and therefore provides faster access but requires maintenance.
42. Distinguish between materialised views with aggregates and materialised views containing only joins.
A materialised view with aggregates stores precomputed summaries such as sums or counts. A join-only materialised view stores the joined result of tables without aggregation, mainly to avoid repeated expensive joins.
43. Distinguish between normalisation and row splitting in vertical partitioning.
Normalisation divides data into related tables to remove redundancy and update anomalies. Row splitting separates frequently used and infrequently used columns mainly to improve access performance.
Part D: Exam Style Q and A
44. Explain the physical design process in a data warehouse.
The physical design process converts the logical warehouse model into an efficient implementation on actual hardware and DBMS infrastructure. It includes defining standards, planning aggregates, deciding partitioning schemes, selecting clustering options, preparing indexing strategies, assigning storage structures, and completing the physical schema. Its purpose is to ensure that the warehouse performs efficiently, scales properly, uses storage effectively, and remains easy to administer. Unlike logical design, physical design focuses on operational behaviour rather than only structural correctness.
45. Explain why partitioning is necessary in a data warehouse.
Partitioning is necessary because warehouse fact tables grow to very large sizes and become difficult to manage as single units. It improves manageability by breaking large tables into smaller segments, supports backup and recovery by allowing operations on selected partitions, and improves performance by ensuring that queries scan only relevant subsets of data. It also supports easier loading and archival operations. Thus partitioning is both a performance technique and an administrative control mechanism.
46. Explain the different methods of horizontal partitioning.
Horizontal partitioning divides a table by rows. One method is partitioning by time into equal segments, such as monthly or yearly partitions. This is useful when queries follow fixed time windows. Another method is partitioning by time into different-sized segments, where recent data is kept in small detailed partitions and older data in larger partitions. A third method is partitioning on another stable dimension such as region or product group, which is useful when access is strongly concentrated on that dimension. A fourth method is partitioning by size of table, where new partitions are created when size thresholds are reached. This is used only when no meaningful business dimension is available.
47. Explain vertical partitioning and its usefulness.
Vertical partitioning divides a table by columns. It is useful when a table contains many attributes but most queries use only a subset of them. By separating frequently used columns from infrequently used columns, the main table becomes narrower and faster to access. Vertical partitioning can be achieved through normalisation or row splitting. It is beneficial only when the system does not frequently need major joins between the resulting partitions.
48. Explain the importance of choosing the correct partition key.
The partition key determines how the table is physically divided and how many partitions a query must access. A good partition key aligns with the most common access pattern so that most queries touch only relevant partitions. For example, if most users query region-specific data, region is a better partition key than transaction date. If a wrong key is selected, queries may need to scan many irrelevant partitions, reducing performance and possibly forcing later repartitioning.
49. Explain indexing strategy in a data warehouse.
Indexing strategy in a data warehouse involves deciding which columns should be indexed, what type of indexes should be used, and how many indexes can be supported without harming loading performance. B-Tree indexes are appropriate for highly selective columns, while bitmap indexes are suitable for low-cardinality attributes common in warehouse analysis. Fact tables often need indexing on composite keys formed from connected dimension keys. The strategy must improve retrieval speed while controlling maintenance overhead.
50. Explain the role of aggregation in data warehouse performance.
Aggregation improves warehouse performance by reducing the amount of detailed data that must be processed during analytical queries. Common summaries are precomputed and stored so that repeated calculations do not need to be performed each time. This is especially important in decision support systems where users frequently ask multidimensional summary questions. Aggregation must be planned carefully because while it reduces query time, it also increases storage and maintenance cost.
51. Explain how summary tables should be designed.
Summary tables should be designed by first identifying which dimensions are to be aggregated, then deciding whether multiple related measures should be stored together, and choosing the correct level of summarisation. Dimensions not affected by aggregation should remain available in the summary table so that analysis remains flexible. Time must be represented clearly, and the summary table itself should be indexed. The design must ensure that the benefit in query speed is greater than the cost of maintaining the summary.
52. Explain materialised views and their importance in OLAP environments.
Materialised views are stored results of queries that are physically maintained in the database. They are important in OLAP environments because many queries involve expensive joins and aggregations over very large data sets. By precomputing frequently needed results, materialised views reduce response time significantly. Their usefulness depends on workload patterns, and their main limitation is the need for maintenance whenever underlying tables change.
53. Explain the different types of materialised views.
Materialised views may be of three main types. Materialised views with aggregates store precomputed grouped results such as totals or averages. Materialised views containing only joins store the result of joining base tables without summarisation, mainly to avoid repeated expensive join operations. Nested materialised views are built on top of other materialised views and are useful when multiple higher-level summaries depend on a common lower-level precomputed result.
54. Explain the importance of the TOTALSALES example in understanding materialisation.
The TOTALSALES example shows how one precomputed result can support multiple later queries. Instead of repeatedly joining the Sales table with Product and Location and aggregating each time, sales are first summarised by product identifier and location identifier and stored in a materialised structure. Later category-wise and state-wise analyses can be performed on this stored result. This demonstrates that materialisation can reduce repeated lower-level processing and accelerate several related analytical queries at once.
Part E:
55. Is partitioning the same as indexing?
No. Partitioning divides the table physically into segments. Indexing creates access structures to locate data efficiently.
56. Is aggregation the same as a materialised view?
No. Aggregation is a type of summarisation. A materialized view is any stored query result and may contain aggregates, joins, or both.
57. Is vertical partitioning mainly used for time-based division?
No. Time-based division is a form of horizontal partitioning. Vertical partitioning divides columns, not time periods.
58. Is bitmap indexing better than B-Tree indexing in all cases?
No. Bitmap indexing is best only for low-cardinality columns. B-Tree indexing is better for highly selective columns.
59. Is partitioning by dimension always better than partitioning by time?
No. Partitioning by dimension is beneficial only when that dimension drives most queries and is stable over time. Otherwise time-based partitioning is safer.
60. Do materialised views remove the need for base tables?
No. Materialized views are derived structures. Base tables remain the primary source of detailed data.
1 Comment