Model Answers with Explanations

Q1(A) – Why must an OLTP system have a Normalized Schema? [5 Marks]

The Core Idea

OLTP = Online Transaction Processing. Think of it as the system your bank uses every time you swipe your card. It handles thousands of tiny, real-time transactions every second – inserts, updates, deletes.

Normalization means removing redundancy: each piece of data is stored in exactly ONE place.

An OLTP system is designed to process day-to-day business transactions such as insert, update, delete, and short retrieval operations. Because of this, it must use a highly normalized schema to reduce data redundancy and maintain data consistency. 

In a normalized structure, the same piece of information is stored only once. This prevents update anomalies, insertion anomalies, and deletion anomalies. This is very important in transactional systems because the data changes frequently and must always remain accurate.

A normalized schema also improves integrity by enforcing relationships through keys and constraints. OLTP applications handle many concurrent users performing transactions. A compact and structured schema helps maintain correctness. It also enhances efficiency in transaction processing. It also reduces storage wastage because duplicate data is minimized.

In contrast, a data warehouse is designed mainly for analysis and querying rather than frequent updates. Therefore, it is often denormalised into star or snowflake schema. This allows complex analytical queries to run faster and more easily. So, OLTP requires normalization for operational efficiency and integrity, whereas DWH prefers denormalisation for analytical performance.

5 Key Reasons for Normalization

1. Avoid Data Anomalies

  • Insertion Anomaly: Without normalization, you can’t add a new customer unless they have placed an order in the past. Insertion anomaly means the table structure forces you to enter unrelated data together. You cannot store a new customer’s details unless that customer has already placed an order. The table is designed to save customer information and order information in the same row. Even if you want to add a customer now, you must wait because the database requires an existing order.
  • Update Anomaly: A customer’s city stored in 1000 rows – change one, risk mismatch.
  • Deletion Anomaly: Deleting the last order could accidentally delete the customer record.

2. Data Integrity

Each fact lives in one place. When a salary is updated, it updates everywhere automatically – no stale copies floating around.

3. Efficient Writes (INSERT/UPDATE/DELETE)

OLTP is write-heavy. Normalized tables are smaller, so writes touch fewer rows and fewer pages – faster!

4. Reduces Storage

No duplicate data = smaller disk usage. Huge benefit when you’re storing millions of daily transactions.

5. Supports Concurrent Users

Normalization leads to finer-grained locking. Multiple users can update different rows of different tables simultaneously without stepping on each other’s toes.

Q2(A) – Why do companies deploy Data Warehousing systems? [2 Marks]

Companies use Data Warehouses (DWH) to answer strategic questions that OLTP systems can’t handle efficiently:

  • Historical Analysis: OLTP only keeps current state. DWH stores years of data for trend analysis.
  • Fast Reporting: Running a complex sales report on OLTP would slow down live transactions. DWH is optimized for heavy reads.
  • Unified View: Data comes from multiple sources (CRM, ERP, website) – DWH integrates it all into one clean dataset.
  • Decision Making: Managers use it for dashboards, KPIs, and forecasts – not possible with raw OLTP data.

💡 Memory Hook: OLTP asks ‘What happened just now?’ – DWH asks ‘What has been happening over the past 5 years?’

Q2(B) – What is a Staging Area and its Purpose? [3 Marks]

What is it?

The Staging Area is a temporary storage zone between the source systems (OLTP databases) and the Data Warehouse. Think of it as a ‘holding room’ or ‘kitchen prep counter’ before data is served to the warehouse.

Purpose / Why it matters:

  • Data Extraction Without Impact: Data is first pulled from source OLTP systems into the staging area. This way, the OLTP is not burdened by complex transformation queries.
  • Data Cleansing: Here, dirty, inconsistent, or duplicate data gets cleaned. E.g., ‘M’/’Male’/’m’ → standardized to ‘Male’.
  • Data Transformation: Format changes happen here – date formats, currency conversions, code lookups.
  • Data Integration: Multiple source systems are combined here before loading into the DWH.
  • Auditability & Recovery: If the DWH load fails, staging data can be reused without re-extracting from source.

💡 Analogy: Think of a staging area as an airport transit lounge. Passengers (data) wait here before they continue to their final destination (DWH).

Q3 – Hospital Data Warehouse (Star Schema + OLAP) [5 Marks]

(a) Star Schema Diagram – Described

3.  Suppose that a hospital data warehouse consists of the three dimensions time, doctor, and patient and a measure for fees charged for a visit?  

(a)   Draw a schema diagram for the above data warehouse. [2marks]

(b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? [3marks]

How to build the STAR schema

For a question like this, build the star schema in three simple steps.

First, identify the measure.
Here, the measure is fees charged for a visit. The measure always goes into the fact table.

Second, identify the dimensions given in the question.
Here, the dimensions are time, doctor, and patient. Each of these becomes a separate dimension table.

Third, connect all dimension tables to the central fact table.
So the fact table will contain the foreign keys of all dimensions plus the measure.

Visit_Fact

  • Time_ID
  • Doctor_ID
  • Patient_ID
  • Fees_Charged

Then create the dimension tables:

Time_Dim with attributes like Date, Month, Quarter, Year
Doctor_Dim with attributes like Doctor_ID, Doctor_Name, Specialisation
Patient_Dim with attributes like Patient_ID, Patient_Name, Gender, Age

So the rule is:

Dimensions become surrounding dimension tables
Numeric value to be analyzed becomes the fact table measure
Fact table stays in the center and links to all dimensions

That is the reason for its name, star schema. One central fact table sits with dimension tables around it, resembling a star.

For this question, the final idea is:

Visit_Fact(Time_ID, Doctor_ID, Patient_ID, Fees_Charged)
connected to
Time_Dim, Doctor_Dim, and Patient_Dim.

Sample Snowflake Schema Question

The hospital data warehouse includes three dimensions: time, doctor, and patient. It also measures the fees charged for a visit. If the dimensions are further normalized such that time is divided into date, month, quarter, and year, and doctor is divided into doctor, specialization, and department, then:

(a) Draw a snowflake schema diagram for the above data warehouse.
(b) Identify the fact table, dimension tables, and the measure.

For this hospital data warehouse, the fact table is:

Visit_Fact(Time_ID, Doctor_ID, Patient_ID, Fees_Charged)

Fees_Charged is the measure. Time_ID, Doctor_ID, and Patient_ID are foreign keys. They link the fact table to the dimensions. The correct main dimension tables are:

Date_Dim(Time_ID, Date, Month_ID)
Doctor_Dim(Doctor_ID, Doctor_Name, Specialization_ID, Department_ID)
Patient_Dim(Patient_ID, Patient_Name, Gender, Age)

These are the tables that should connect directly to Visit_Fact.

Then the normalized branches should be shown as follows:

Date_Dim -> Month_Dim(Month_ID, Month, Quarter_ID)
Month_Dim -> Quarter_Dim(Quarter_ID, Quarter, Year_ID)
Quarter_Dim -> Year_Dim(Year_ID, Year)

and

Doctor_Dim -> Specialisation_Dim(Specialisation_ID, Specialisation_Name)
Doctor_Dim -> Department_Dim(Department_ID, Department_Name)

The Patient_Dim table can remain as a single dimension table. The question does not require it to be further normalized.

So the correct structure is:

Visit_Fact -> Date_Dim -> Month_Dim -> Quarter_Dim -> Year_Dim
Visit_Fact -> Doctor_Dim -> Specialisation_Dim
Visit_Fact -> Doctor_Dim -> Department_Dim
Visit_Fact -> Patient_Dim

(b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004?

OLAP Operations to list total fee per doctor in 2004

Starting from base cuboid: [day, doctor, patient]

Step 1: Roll-Up on Time

Roll up from ‘day’ to ‘year’. This collapses daily data into annual totals.

Result: [year, doctor, patient]

Step 2: Slice on Time = 2004

Fix the year dimension to 2004. This filters only the year 2004 records.

Result: [doctor, patient] for year = 2004

Step 3: Roll-Up on Patient (Drill Across / Aggregate)

Sum the fees across all patients for each doctor. This aggregates patient dimension away.

Final Result: [doctor] → total_fees for 2004

💡 Memory Hook: Roll Up = Zooming Out (day → year). Slice = Applying a filter. Aggregate = Summing up values.

Q4 – Slowly Changing Dimensions (SCD) [5 Marks]

Initial Dimension Table Structure

Surrogate KeyEmp_IDNameSalaryDesignationJoining_DateOffice_LocationStart_DateEnd_DateIs_Current
1E101Alice10,000Developer2020-01-01Mumbai2020-01-01NULLY

Change (A): Alice’s Office Location changes from Mumbai to Delhi

Type: SCD Type 1 – Overwrite

Why? Office location is not historically significant. Companies don’t need to track ‘where was the employee 3 years ago.’

How? Simply UPDATE the existing row. No new row is added.

Surrogate KeyEmp_IDNameOffice_LocationNote
BEFOREE101AliceMumbai
AFTERE101AliceDelhi← Row overwritten

Change (B): Alice gets promoted from Developer to Lead

Type: SCD Type 2 – Add New Row

Why? Promotion is historically important. HR needs to know: ‘What was Alice’s role when she worked on Project X in 2021?’ The old row must be preserved.

How? Mark the old row as expired (Is_Current = N, set End_Date). Insert a new row with new designation and updated salary.

Before and After (showing salary: Developer=10k, Lead=40k):

Surr_KeyEmp_IDNameSalaryDesignationStart_DateEnd_DateIs_Current
1E101Alice10,000Developer2020-01-012025-09-19N
2E101Alice40,000Lead2025-09-20NULLY

💡 Memory Hook: SCD Type 1 = Forget the past (overwrite). SCD Type 2 = Remember the past (add a new row with versioning).

Q5 – Distinguish Between Concepts [5 Marks]

(a) OLTP vs OLAP [1 Mark]

AspectOLTPOLAP
PurposeDay-to-day operationsBusiness analysis & reporting
Query TypeSimple: INSERT, UPDATE, DELETEComplex: GROUP BY, aggregations
Data VolumeCurrent data (MB-GB)Historical data (GB-TB)
UsersClerks, customers, employeesAnalysts, managers, executives
SchemaNormalized (3NF)Denormalized (Star/Snowflake)

(b) Data Warehouse (DWH) vs Data Mart (DM) [2 Marks]

AspectData WarehouseData Mart
ScopeEnterprise-wide, all departmentsSubject-specific (e.g., Sales only)
SizeVery large (TB-PB)Smaller subset
UsersAll business analystsSpecific department (e.g., Finance team)
Build TimeLong (months)Faster (weeks)
ExampleCentral corporate DWHSales DM, HR DM, Finance DM

💡 Analogy: DWH = A large central library. DM = A specialized section within the library (e.g., just the Science section).

(c) ROLAP vs MOLAP [2 Marks]

AspectROLAP (Relational OLAP)MOLAP (Multidimensional OLAP)
StorageRelational DB (SQL tables)Multidimensional cube (array)
Query SpeedSlower (SQL joins at runtime)Very fast (pre-aggregated cubes)
ScalabilityHigh – handles huge datasetsLimited – cube size can explode
FlexibilityMore flexible SQL queriesLess flexible but faster
ExampleOracle BI, TeradataMicrosoft SSAS, IBM TM1

💡 Memory Hook: ROLAP = Real-time SQL. MOLAP = Pre-baked cubes for speed.

Q6 – Role of Metadata in DWH and its Types [5 Marks]

What is Metadata?

Metadata is ‘data about data.’ In a Data Warehouse, it acts as the central nervous system. Without it, no one knows what data exists. People don’t know where it came from or how to use it.

Analogy: Think of metadata as the index and labels in a library. Without them, you have thousands of books (data) but no way to find or understand them.

Role of Metadata in DWH:

  • Guides the ETL process – tells the system what to extract, how to transform, and where to load.
  • Helps business users understand the meaning of data – what does ‘Total Revenue’ actually include?
  • Enables data lineage tracking – where did this number come from?
  • Supports query optimization – tells the query engine about indexes, partitions, and structure.
  • Facilitates data governance and auditing.

Types of Metadata:

TypeWhat it ContainsExample
Technical MetadataPhysical structure of DWH objectsTable names, column names, data types, indexes, ETL schedules
Business MetadataBusiness meaning and rules‘Revenue’ = Gross Sales minus Returns; KPI definitions
Operational MetadataRuntime info – how and when processes ranETL job start/end time, rows processed, errors logged
Process MetadataTracks data transformation stepsSource-to-target mappings, transformation logic used in ETL

💡 Memory Hook: Technical = What it is. Business = What it means. Operational = What happened when. Process = How it got here.

Metadata is typically stored in a Metadata Repository and accessed by BI tools and ETL engines alike. Without it, a DWH is essentially a black box.

Prepared as intuitive model answers for CC ZG515 – Data Warehousing | BITS Pilani

Leave a Reply