Model Answers with Explanations
Q1(A) – Why must an OLTP system have a Normalized Schema? [5 Marks]
The Core Idea
OLTP = Online Transaction Processing. Think of it as the system your bank uses every time you swipe your card. It handles thousands of tiny, real-time transactions every second – inserts, updates, deletes.
Normalization means removing redundancy: each piece of data is stored in exactly ONE place.
An OLTP system is designed to process day-to-day business transactions such as insert, update, delete, and short retrieval operations. Because of this, it must use a highly normalized schema to reduce data redundancy and maintain data consistency.
In a normalized structure, the same piece of information is stored only once. This prevents update anomalies, insertion anomalies, and deletion anomalies. This is very important in transactional systems because the data changes frequently and must always remain accurate.
A normalized schema also improves integrity by enforcing relationships through keys and constraints. OLTP applications handle many concurrent users performing transactions. A compact and structured schema helps maintain correctness. It also enhances efficiency in transaction processing. It also reduces storage wastage because duplicate data is minimized.
In contrast, a data warehouse is designed mainly for analysis and querying rather than frequent updates. Therefore, it is often denormalised into star or snowflake schema. This allows complex analytical queries to run faster and more easily. So, OLTP requires normalization for operational efficiency and integrity, whereas DWH prefers denormalisation for analytical performance.
5 Key Reasons for Normalization
1. Avoid Data Anomalies
- Insertion Anomaly: Without normalization, you can’t add a new customer unless they have placed an order in the past. Insertion anomaly means the table structure forces you to enter unrelated data together. You cannot store a new customer’s details unless that customer has already placed an order. The table is designed to save customer information and order information in the same row. Even if you want to add a customer now, you must wait because the database requires an existing order.
- Update Anomaly: A customer’s city stored in 1000 rows – change one, risk mismatch.
- Deletion Anomaly: Deleting the last order could accidentally delete the customer record.
2. Data Integrity
Each fact lives in one place. When a salary is updated, it updates everywhere automatically – no stale copies floating around.
3. Efficient Writes (INSERT/UPDATE/DELETE)
OLTP is write-heavy. Normalized tables are smaller, so writes touch fewer rows and fewer pages – faster!
4. Reduces Storage
No duplicate data = smaller disk usage. Huge benefit when you’re storing millions of daily transactions.
5. Supports Concurrent Users
Normalization leads to finer-grained locking. Multiple users can update different rows of different tables simultaneously without stepping on each other’s toes.
Q2(A) – Why do companies deploy Data Warehousing systems? [2 Marks]
Companies use Data Warehouses (DWH) to answer strategic questions that OLTP systems can’t handle efficiently:
- Historical Analysis: OLTP only keeps current state. DWH stores years of data for trend analysis.
- Fast Reporting: Running a complex sales report on OLTP would slow down live transactions. DWH is optimized for heavy reads.
- Unified View: Data comes from multiple sources (CRM, ERP, website) – DWH integrates it all into one clean dataset.
- Decision Making: Managers use it for dashboards, KPIs, and forecasts – not possible with raw OLTP data.
💡 Memory Hook: OLTP asks ‘What happened just now?’ – DWH asks ‘What has been happening over the past 5 years?’
Q2(B) – What is a Staging Area and its Purpose? [3 Marks]
What is it?
The Staging Area is a temporary storage zone between the source systems (OLTP databases) and the Data Warehouse. Think of it as a ‘holding room’ or ‘kitchen prep counter’ before data is served to the warehouse.
Purpose / Why it matters:
- Data Extraction Without Impact: Data is first pulled from source OLTP systems into the staging area. This way, the OLTP is not burdened by complex transformation queries.
- Data Cleansing: Here, dirty, inconsistent, or duplicate data gets cleaned. E.g., ‘M’/’Male’/’m’ → standardized to ‘Male’.
- Data Transformation: Format changes happen here – date formats, currency conversions, code lookups.
- Data Integration: Multiple source systems are combined here before loading into the DWH.
- Auditability & Recovery: If the DWH load fails, staging data can be reused without re-extracting from source.
💡 Analogy: Think of a staging area as an airport transit lounge. Passengers (data) wait here before they continue to their final destination (DWH).
Q3 – Hospital Data Warehouse (Star Schema + OLAP) [5 Marks]
(a) Star Schema Diagram – Described
3. Suppose that a hospital data warehouse consists of the three dimensions time, doctor, and patient and a measure for fees charged for a visit?
(a) Draw a schema diagram for the above data warehouse. [2marks]
(b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? [3marks]

How to build the STAR schema
For a question like this, build the star schema in three simple steps.
First, identify the measure.
Here, the measure is fees charged for a visit. The measure always goes into the fact table.
Second, identify the dimensions given in the question.
Here, the dimensions are time, doctor, and patient. Each of these becomes a separate dimension table.
Third, connect all dimension tables to the central fact table.
So the fact table will contain the foreign keys of all dimensions plus the measure.
Visit_Fact
- Time_ID
- Doctor_ID
- Patient_ID
- Fees_Charged
Then create the dimension tables:
Time_Dim with attributes like Date, Month, Quarter, Year
Doctor_Dim with attributes like Doctor_ID, Doctor_Name, Specialisation
Patient_Dim with attributes like Patient_ID, Patient_Name, Gender, Age
So the rule is:
Dimensions become surrounding dimension tables
Numeric value to be analyzed becomes the fact table measure
Fact table stays in the center and links to all dimensions
That is the reason for its name, star schema. One central fact table sits with dimension tables around it, resembling a star.
For this question, the final idea is:
Visit_Fact(Time_ID, Doctor_ID, Patient_ID, Fees_Charged)
connected to
Time_Dim, Doctor_Dim, and Patient_Dim.
Sample Snowflake Schema Question
The hospital data warehouse includes three dimensions: time, doctor, and patient. It also measures the fees charged for a visit. If the dimensions are further normalized such that time is divided into date, month, quarter, and year, and doctor is divided into doctor, specialization, and department, then:
(a) Draw a snowflake schema diagram for the above data warehouse.
(b) Identify the fact table, dimension tables, and the measure.
For this hospital data warehouse, the fact table is:
Visit_Fact(Time_ID, Doctor_ID, Patient_ID, Fees_Charged)
Fees_Charged is the measure. Time_ID, Doctor_ID, and Patient_ID are foreign keys. They link the fact table to the dimensions. The correct main dimension tables are:
Date_Dim(Time_ID, Date, Month_ID)
Doctor_Dim(Doctor_ID, Doctor_Name, Specialization_ID, Department_ID)
Patient_Dim(Patient_ID, Patient_Name, Gender, Age)
These are the tables that should connect directly to Visit_Fact.
Then the normalized branches should be shown as follows:
Date_Dim -> Month_Dim(Month_ID, Month, Quarter_ID)
Month_Dim -> Quarter_Dim(Quarter_ID, Quarter, Year_ID)
Quarter_Dim -> Year_Dim(Year_ID, Year)
and
Doctor_Dim -> Specialisation_Dim(Specialisation_ID, Specialisation_Name)
Doctor_Dim -> Department_Dim(Department_ID, Department_Name)
The Patient_Dim table can remain as a single dimension table. The question does not require it to be further normalized.
So the correct structure is:
Visit_Fact -> Date_Dim -> Month_Dim -> Quarter_Dim -> Year_Dim
Visit_Fact -> Doctor_Dim -> Specialisation_Dim
Visit_Fact -> Doctor_Dim -> Department_Dim
Visit_Fact -> Patient_Dim

(b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004?
OLAP Operations to list total fee per doctor in 2004
Starting from base cuboid: [day, doctor, patient]
Step 1: Roll-Up on Time
Roll up from ‘day’ to ‘year’. This collapses daily data into annual totals.
Result: [year, doctor, patient]
Step 2: Slice on Time = 2004
Fix the year dimension to 2004. This filters only the year 2004 records.
Result: [doctor, patient] for year = 2004
Step 3: Roll-Up on Patient (Drill Across / Aggregate)
Sum the fees across all patients for each doctor. This aggregates patient dimension away.
Final Result: [doctor] → total_fees for 2004
💡 Memory Hook: Roll Up = Zooming Out (day → year). Slice = Applying a filter. Aggregate = Summing up values.
Q4 – Slowly Changing Dimensions (SCD) [5 Marks]
Initial Dimension Table Structure
| Surrogate Key | Emp_ID | Name | Salary | Designation | Joining_Date | Office_Location | Start_Date | End_Date | Is_Current |
|---|---|---|---|---|---|---|---|---|---|
| 1 | E101 | Alice | 10,000 | Developer | 2020-01-01 | Mumbai | 2020-01-01 | NULL | Y |
Change (A): Alice’s Office Location changes from Mumbai to Delhi
Type: SCD Type 1 – Overwrite
Why? Office location is not historically significant. Companies don’t need to track ‘where was the employee 3 years ago.’
How? Simply UPDATE the existing row. No new row is added.
| Surrogate Key | Emp_ID | Name | Office_Location | Note |
|---|---|---|---|---|
| BEFORE | E101 | Alice | Mumbai | |
| AFTER | E101 | Alice | Delhi | ← Row overwritten |
Change (B): Alice gets promoted from Developer to Lead
Type: SCD Type 2 – Add New Row
Why? Promotion is historically important. HR needs to know: ‘What was Alice’s role when she worked on Project X in 2021?’ The old row must be preserved.
How? Mark the old row as expired (Is_Current = N, set End_Date). Insert a new row with new designation and updated salary.
Before and After (showing salary: Developer=10k, Lead=40k):
| Surr_Key | Emp_ID | Name | Salary | Designation | Start_Date | End_Date | Is_Current |
|---|---|---|---|---|---|---|---|
| 1 | E101 | Alice | 10,000 | Developer | 2020-01-01 | 2025-09-19 | N |
| 2 | E101 | Alice | 40,000 | Lead | 2025-09-20 | NULL | Y |
💡 Memory Hook: SCD Type 1 = Forget the past (overwrite). SCD Type 2 = Remember the past (add a new row with versioning).
Q5 – Distinguish Between Concepts [5 Marks]
(a) OLTP vs OLAP [1 Mark]
| Aspect | OLTP | OLAP |
|---|---|---|
| Purpose | Day-to-day operations | Business analysis & reporting |
| Query Type | Simple: INSERT, UPDATE, DELETE | Complex: GROUP BY, aggregations |
| Data Volume | Current data (MB-GB) | Historical data (GB-TB) |
| Users | Clerks, customers, employees | Analysts, managers, executives |
| Schema | Normalized (3NF) | Denormalized (Star/Snowflake) |
(b) Data Warehouse (DWH) vs Data Mart (DM) [2 Marks]
| Aspect | Data Warehouse | Data Mart |
|---|---|---|
| Scope | Enterprise-wide, all departments | Subject-specific (e.g., Sales only) |
| Size | Very large (TB-PB) | Smaller subset |
| Users | All business analysts | Specific department (e.g., Finance team) |
| Build Time | Long (months) | Faster (weeks) |
| Example | Central corporate DWH | Sales DM, HR DM, Finance DM |
💡 Analogy: DWH = A large central library. DM = A specialized section within the library (e.g., just the Science section).
(c) ROLAP vs MOLAP [2 Marks]
| Aspect | ROLAP (Relational OLAP) | MOLAP (Multidimensional OLAP) |
|---|---|---|
| Storage | Relational DB (SQL tables) | Multidimensional cube (array) |
| Query Speed | Slower (SQL joins at runtime) | Very fast (pre-aggregated cubes) |
| Scalability | High – handles huge datasets | Limited – cube size can explode |
| Flexibility | More flexible SQL queries | Less flexible but faster |
| Example | Oracle BI, Teradata | Microsoft SSAS, IBM TM1 |
💡 Memory Hook: ROLAP = Real-time SQL. MOLAP = Pre-baked cubes for speed.
Q6 – Role of Metadata in DWH and its Types [5 Marks]
What is Metadata?
Metadata is ‘data about data.’ In a Data Warehouse, it acts as the central nervous system. Without it, no one knows what data exists. People don’t know where it came from or how to use it.
Analogy: Think of metadata as the index and labels in a library. Without them, you have thousands of books (data) but no way to find or understand them.
Role of Metadata in DWH:
- Guides the ETL process – tells the system what to extract, how to transform, and where to load.
- Helps business users understand the meaning of data – what does ‘Total Revenue’ actually include?
- Enables data lineage tracking – where did this number come from?
- Supports query optimization – tells the query engine about indexes, partitions, and structure.
- Facilitates data governance and auditing.
Types of Metadata:
| Type | What it Contains | Example |
|---|---|---|
| Technical Metadata | Physical structure of DWH objects | Table names, column names, data types, indexes, ETL schedules |
| Business Metadata | Business meaning and rules | ‘Revenue’ = Gross Sales minus Returns; KPI definitions |
| Operational Metadata | Runtime info – how and when processes ran | ETL job start/end time, rows processed, errors logged |
| Process Metadata | Tracks data transformation steps | Source-to-target mappings, transformation logic used in ETL |
💡 Memory Hook: Technical = What it is. Business = What it means. Operational = What happened when. Process = How it got here.
Metadata is typically stored in a Metadata Repository and accessed by BI tools and ETL engines alike. Without it, a DWH is essentially a black box.
Prepared as intuitive model answers for CC ZG515 – Data Warehousing | BITS Pilani
1 Comment