Sample Exam Style Questions Datawarehouse

Question 1

A group of 12 sales price records is sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition the data into 3 bins using:

Equal-frequency (equal-depth) partitioning
Equal-width partitioning
Clustering

Further, using the equal-frequency bins, smooth the data by:

Bin means
Bin medians

1. Equal-frequency partitioning

Idea: each bin must contain the same number of data points.
Total values = 12, bins = 3 → 12 / 3 = 4 values per bin
Since the data is already sorted, just split every 4 values:
- Bin 1: 5, 10, 11, 13
- Bin 2: 15, 35, 50, 55
- Bin 3: 72, 92, 204, 215
Explanation: We don’t care about the numeric range here. We only ensure each bin has equal data size, which helps balance the distribution.

2. Equal-width partitioning

Idea: each bin covers the same numeric range (width).
Compute width: Width = (max − min) / bins = (215 − 5) / 3 = 210 / 3 = 70
Create intervals starting from minimum value:
- Bin 1: 5 to 75 → contains: 5, 10, 11, 13, 15, 35, 50, 55, 72
- Bin 2: 75 to 145 → contains: 92
- Bin 3: 145 to 215 → contains: 204, 215
Place values into bins based on where they fall.
Explanation: Here we ignore how many values go into each bin. Instead, we divide the value range evenly, which can lead to uneven data distribution.

3. Clustering

Idea: group values based on natural similarity or gaps in data.
Observing the data:
- Small values close together → 5, 10, 11, 13, 15
- Medium values grouped → 35, 50, 55, 72, 92
- Large values far apart → 204, 215
So clusters are:
- Cluster 1: 5, 10, 11, 13, 15
- Cluster 2: 35, 50, 55, 72, 92
- Cluster 3: 204, 215
Explanation: Unlike binning, clustering does not follow fixed rules. It groups values based on data distribution, often where there are large gaps.

4. Smoothing by bin means

Idea: replace all values in a bin with the average (mean) to reduce variation.
Using equal-frequency bins:
- Bin 1 mean: (5 + 10 + 11 + 13) / 4 = 9.75
- Bin 2 mean: (15 + 35 + 50 + 55) / 4 = 38.75
- Bin 3 mean: (72 + 92 + 204 + 215) / 4 = 145.75
Replace each value with the bin mean:
- Bin 1 → 9.75, 9.75, 9.75, 9.75
- Bin 2 → 38.75, 38.75, 38.75, 38.75
- Bin 3 → 145.75, 145.75, 145.75, 145.75
Explanation: This reduces noise by making all values in a bin uniform, but may be affected by extreme values (e.g. 204, 215).

5. Smoothing by bin medians

Idea: replace values with the median (middle value), which is more robust to outliers.
Each bin has 4 values → median = average of middle two:
- Bin 1 median: (10 + 11) / 2 = 10.5
- Bin 2 median: (35 + 50) / 2 = 42.5
- Bin 3 median: (92 + 204) / 2 = 148
Replace values:
- Bin 1 → 10.5, 10.5, 10.5, 10.5
- Bin 2 → 42.5, 42.5, 42.5, 42.5
- Bin 3 → 148, 148, 148, 148
Explanation: Median smoothing is preferred when data has extreme values, because it is less sensitive to them than the mean.

Key Points

Equal-frequency → equal number of values per bin
Equal-width → equal numeric range per bin
Clustering → groups based on natural data distribution
Mean smoothing → reduces variation but sensitive to outliers
Median smoothing → reduces variation and handles outliers better

Question 2

Give reasons why an OLTP system must have a normalized schema structure compared to a DWH. [5 marks]

Answer

An OLTP system must use a normalised schema because it handles a very large number of short insert, update, and delete transactions. Normalisation reduces redundancy, prevents update anomalies, and supports data integrity in multi-user transactional environments. OLTP databases store detailed and current operational data and are usually designed using entity models such as 3NF.

A data warehouse, in contrast, is built for analysis rather than transaction processing. It stores historical and aggregated data and is usually organised using denormalised dimensional schemas such as star or snowflake. Denormalisation reduces the number of joins and improves performance for complex analytical queries and reporting.

Question 3

Why do companies deploy data warehousing systems? [2 marks]

Answer

Companies deploy data warehousing systems to integrate data from multiple sources into a common repository for analysis and decision-making. Transactional databases cannot answer many complex business questions efficiently, whereas a data warehouse provides consolidated, historical, and analysable data for reporting and business intelligence.

Question 3(b)

What is staging area and what is the purpose of staging area in data warehousing. [3 marks]

Answer

A staging area is an intermediate storage area where data extracted from different source systems is accumulated before loading into the warehouse. At a predefined cutoff time, the staged data is transformed and then loaded into the warehouse. There is usually no end-user access to the staging area.

Its purpose is to support validation, cleansing, transformation, and integration without degrading the performance of the source systems. It also provides a controlled point where extracted data can be checked before it moves into the warehouse.

Question 4

Suppose that a hospital data warehouse consists of the three dimensions time, doctor, and patient and a measure for fees charged for a visit?

(a) Draw a schema diagram for the above data warehouse. [2 marks]

(b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? [3 marks]

Answer to 4(a)

This should be modelled as a star schema.

Fact table: Visit_Fact

Time_Key
Doctor_Key
Patient_Key
Fees_Charged

Dimension tables

Time_Dimension
Doctor_Dimension
Patient_Dimension

A star schema has a central fact table connected directly to dimension tables. The fact table stores the measure and foreign keys; the dimensions store descriptive attributes.

Answer to 4(b)

Starting from base cuboid [day, doctor, patient], the required steps are:

Slice the time dimension for year = 2004. Slice selects one particular value of a dimension and produces a sub-cube.
Roll up the time dimension from day to year, if the starting data is still at daily level. Roll-up aggregates data to a higher level in a hierarchy.
Aggregate across the patient dimension so that the result is grouped only by doctor. This is effectively summarisation by dimension reduction.

Final output: total fee collected by each doctor in 2004.

Question 5

The attributes of Employee in the source system are Employee id, Name, Salary, Designation, Joining date and Office location.

Assume you created a dimension table in the data warehouse for Employee and the initial load is performed. Give the dimension table structure.

Identify the type of changes (A) and (B) mentioned below and explain HOW the changes, once identified, will be implemented in the DWH and WHY? [2 Marks]

Show the before and after image of the dimension table. Additional fact: Salary of Developer is 10k and of a Lead is 40k. [3 Marks]

The uploaded question text does not include the actual descriptions of changes (A) and (B), so they cannot be identified exactly from the file. What can be answered confidently is the correct employee dimension structure and the SCD handling logic that should be applied once A and B are known.

Employee Dimension Structure

A suitable employee dimension table is:

Emp_Key	Employee_ID	Name	Salary	Designation	Joining_Date	Office_Location	Start_Date	End_Date	Current_Flag

Emp_Key is the surrogate key. The business key is Employee_ID. The extra columns Start_Date, End_Date, and Current_Flag are useful if Type 2 history needs to be maintained. Type 2 commonly uses a new surrogate key for each historical version together with administrative history columns.

How to Identify the Change Type

If the change is a correction of a wrong value, it is handled as Type 1 SCD. Type 1 overwrites the old value and does not preserve history. This is suitable for spelling corrections or cleanup changes.

If the change represents a real historical change such as employee transfer, promotion, or salary revision that must be preserved over time, it is handled as Type 2 SCD. Type 2 inserts a new row with a new surrogate key and keeps the old row for historical reporting.

Illustrative Before and After

Before change

Emp_Key	Employee_ID	Name	Salary	Designation	Joining_Date	Office_Location	Start_Date	End_Date	Current_Flag
101	E001	Ravi	10000	Developer	2022-01-10	Chennai	2022-01-10	NULL	Y

If the change is Type 1

Suppose only a correction is needed, such as fixing office location spelling. The row is overwritten.

Emp_Key	Employee_ID	Name	Salary	Designation	Joining_Date	Office_Location	Start_Date	End_Date	Current_Flag
101	E001	Ravi	10000	Developer	2022-01-10	Chennai HQ	2022-01-10	NULL	Y

If the change is Type 2

Suppose Ravi is promoted from Developer to Lead, and salary changes from 10k to 40k. The old row is closed and a new row is inserted.

Emp_Key	Employee_ID	Name	Salary	Designation	Joining_Date	Office_Location	Start_Date	End_Date	Current_Flag
101	E001	Ravi	10000	Developer	2022-01-10	Chennai	2022-01-10	2025-09-20	N
205	E001	Ravi	40000	Lead	2022-01-10	Chennai	2025-09-21	NULL	Y

This preserves history, which is the reason Type 2 is chosen when past values matter for reporting.

Question 5

Distinguish between

a) OLTP vs OLAP [1 Mark]

b) DWH vs DM [2 Marks]

c) ROLAP vs MOLAP. [2 Marks]

5(a) OLTP vs OLAP

OLTP	OLAP
Supports day-to-day transactions	Supports analytical queries
Detailed and current data	Aggregated and historical data
Many short transactions	Fewer but complex queries
Highly normalised schema	Star or snowflake schema
Optimised for transaction speed	Optimised for analytical response time

5(b) DWH vs DM

A Data Warehouse is an enterprise-wide integrated repository containing historical data from multiple subject areas. A Data Mart is a smaller, subject-oriented or department-oriented subset focused on a specific business process or analytical area. Rows in the bus matrix often correspond to data marts, and first-level data marts are derived from single primary source systems.

5(c) ROLAP vs MOLAP

ROLAP	MOLAP
Stores data in standard relational tables	Stores data in proprietary multidimensional structures
Uses star or snowflake schemas	Uses pre-calculated cubes
Better scalability for large data volumes	Very fast query response
Slower for complex analytical operations	Better for slicing, dicing, and complex calculations
Easier drill-through to detail	Practical size limitations and extra investment

Question 6

Describe the role of Meta Data in DWH and explain its types. [5 marks]

Answer

Metadata is data about data. In a data warehouse, metadata describes the warehouse structure, source-to-target mappings, business meanings, transformation logic, load history, and operational status. It supports integration, ETL control, report consistency, governance, and user understanding. ETL tools in the lesson explicitly refer to metadata management and business/process metadata management, which shows that metadata is central to warehouse operation.

Types of Metadata

1. Technical metadata
This describes structural and technical details such as table definitions, column names, data types, keys, indexes, schema structures, and ETL mappings. It is mainly used by developers, DBAs, and ETL designers. The lesson discusses schema integration, attribute naming inconsistency, and logical data maps, all of which belong to technical metadata.

2. Business metadata
This describes the business meaning of data elements, calculations, KPIs, report labels, and business definitions. It helps users understand what warehouse data means in business terms rather than technical terms. The lecture material on conformed dimensions, measures, and reporting logic depends on this business layer of meaning.

3. Operational metadata
This records operational details such as ETL run time, refresh history, load status, validation results, error logs, and job execution information. The ETL lesson discusses load verification, recovery from load failure, and ongoing ETL control, which are all operational metadata concerns.

Role of Metadata in DWH

Metadata is important because it:

explains the structure and meaning of warehouse data,
supports ETL mapping and transformation,
records lineage from source to warehouse,
supports load monitoring and recovery,
improves governance and consistency,
helps users trust and interpret reports correctly.

Question 1

1. Equal-frequency partitioning

2. Equal-width partitioning

3. Clustering

4. Smoothing by bin means

5. Smoothing by bin medians

Key Points

Question 2

Answer

Question 3

Answer

Question 3(b)

Answer

Question 4

Answer to 4(a)

Answer to 4(b)

Question 5

Employee Dimension Structure

How to Identify the Change Type

Illustrative Before and After

If the change is Type 1

If the change is Type 2

Question 5

5(a) OLTP vs OLAP

5(b) DWH vs DM

5(c) ROLAP vs MOLAP

Question 6

Answer

Types of Metadata

Role of Metadata in DWH

Related

Leave a ReplyCancel reply

Discover more from Tech For Talk