Question 1
A group of 12 sales price records is sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition the data into 3 bins using:
- Equal-frequency (equal-depth) partitioning
- Equal-width partitioning
- Clustering
Further, using the equal-frequency bins, smooth the data by:
- Bin means
- Bin medians
1. Equal-frequency partitioning
- Idea: each bin must contain the same number of data points.
- Total values = 12, bins = 3 → 12 / 3 = 4 values per bin
- Since the data is already sorted, just split every 4 values:
- Bin 1: 5, 10, 11, 13
- Bin 2: 15, 35, 50, 55
- Bin 3: 72, 92, 204, 215
- Explanation: We don’t care about the numeric range here. We only ensure each bin has equal data size, which helps balance the distribution.
2. Equal-width partitioning
- Idea: each bin covers the same numeric range (width).
- Compute width: Width = (max − min) / bins = (215 − 5) / 3 = 210 / 3 = 70
- Create intervals starting from minimum value:
- Bin 1: 5 to 75 → contains: 5, 10, 11, 13, 15, 35, 50, 55, 72
- Bin 2: 75 to 145 → contains: 92
- Bin 3: 145 to 215 → contains: 204, 215
- Place values into bins based on where they fall.
- Explanation: Here we ignore how many values go into each bin. Instead, we divide the value range evenly, which can lead to uneven data distribution.
3. Clustering
- Idea: group values based on natural similarity or gaps in data.
- Observing the data:
- Small values close together → 5, 10, 11, 13, 15
- Medium values grouped → 35, 50, 55, 72, 92
- Large values far apart → 204, 215
- So clusters are:
- Cluster 1: 5, 10, 11, 13, 15
- Cluster 2: 35, 50, 55, 72, 92
- Cluster 3: 204, 215
- Explanation: Unlike binning, clustering does not follow fixed rules. It groups values based on data distribution, often where there are large gaps.
4. Smoothing by bin means
- Idea: replace all values in a bin with the average (mean) to reduce variation.
- Using equal-frequency bins:
- Bin 1 mean: (5 + 10 + 11 + 13) / 4 = 9.75
- Bin 2 mean: (15 + 35 + 50 + 55) / 4 = 38.75
- Bin 3 mean: (72 + 92 + 204 + 215) / 4 = 145.75
- Replace each value with the bin mean:
- Bin 1 → 9.75, 9.75, 9.75, 9.75
- Bin 2 → 38.75, 38.75, 38.75, 38.75
- Bin 3 → 145.75, 145.75, 145.75, 145.75
- Explanation: This reduces noise by making all values in a bin uniform, but may be affected by extreme values (e.g. 204, 215).
5. Smoothing by bin medians
- Idea: replace values with the median (middle value), which is more robust to outliers.
- Each bin has 4 values → median = average of middle two:
- Bin 1 median: (10 + 11) / 2 = 10.5
- Bin 2 median: (35 + 50) / 2 = 42.5
- Bin 3 median: (92 + 204) / 2 = 148
- Replace values:
- Bin 1 → 10.5, 10.5, 10.5, 10.5
- Bin 2 → 42.5, 42.5, 42.5, 42.5
- Bin 3 → 148, 148, 148, 148
- Explanation: Median smoothing is preferred when data has extreme values, because it is less sensitive to them than the mean.
Key Points
- Equal-frequency → equal number of values per bin
- Equal-width → equal numeric range per bin
- Clustering → groups based on natural data distribution
- Mean smoothing → reduces variation but sensitive to outliers
- Median smoothing → reduces variation and handles outliers better
Question 2
Give reasons why an OLTP system must have a normalized schema structure compared to a DWH. [5 marks]
Answer
An OLTP system must use a normalised schema because it handles a very large number of short insert, update, and delete transactions. Normalisation reduces redundancy, prevents update anomalies, and supports data integrity in multi-user transactional environments. OLTP databases store detailed and current operational data and are usually designed using entity models such as 3NF.
A data warehouse, in contrast, is built for analysis rather than transaction processing. It stores historical and aggregated data and is usually organised using denormalised dimensional schemas such as star or snowflake. Denormalisation reduces the number of joins and improves performance for complex analytical queries and reporting.
Question 3
Why do companies deploy data warehousing systems? [2 marks]
Answer
Companies deploy data warehousing systems to integrate data from multiple sources into a common repository for analysis and decision-making. Transactional databases cannot answer many complex business questions efficiently, whereas a data warehouse provides consolidated, historical, and analysable data for reporting and business intelligence.
Question 3(b)
What is staging area and what is the purpose of staging area in data warehousing. [3 marks]
Answer
A staging area is an intermediate storage area where data extracted from different source systems is accumulated before loading into the warehouse. At a predefined cutoff time, the staged data is transformed and then loaded into the warehouse. There is usually no end-user access to the staging area.
Its purpose is to support validation, cleansing, transformation, and integration without degrading the performance of the source systems. It also provides a controlled point where extracted data can be checked before it moves into the warehouse.
Question 4
Suppose that a hospital data warehouse consists of the three dimensions time, doctor, and patient and a measure for fees charged for a visit?
(a) Draw a schema diagram for the above data warehouse. [2 marks]
(b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? [3 marks]
Answer to 4(a)
This should be modelled as a star schema.
Fact table: Visit_Fact
- Time_Key
- Doctor_Key
- Patient_Key
- Fees_Charged
Dimension tables
Time_DimensionDoctor_DimensionPatient_Dimension
A star schema has a central fact table connected directly to dimension tables. The fact table stores the measure and foreign keys; the dimensions store descriptive attributes.
Answer to 4(b)
Starting from base cuboid [day, doctor, patient], the required steps are:
- Slice the time dimension for year = 2004. Slice selects one particular value of a dimension and produces a sub-cube.
- Roll up the time dimension from day to year, if the starting data is still at daily level. Roll-up aggregates data to a higher level in a hierarchy.
- Aggregate across the patient dimension so that the result is grouped only by doctor. This is effectively summarisation by dimension reduction.
Final output: total fee collected by each doctor in 2004.
Question 5
The attributes of Employee in the source system are Employee id, Name, Salary, Designation, Joining date and Office location.
Assume you created a dimension table in the data warehouse for Employee and the initial load is performed. Give the dimension table structure.
Identify the type of changes (A) and (B) mentioned below and explain HOW the changes, once identified, will be implemented in the DWH and WHY? [2 Marks]
Show the before and after image of the dimension table. Additional fact: Salary of Developer is 10k and of a Lead is 40k. [3 Marks]
The uploaded question text does not include the actual descriptions of changes (A) and (B), so they cannot be identified exactly from the file. What can be answered confidently is the correct employee dimension structure and the SCD handling logic that should be applied once A and B are known.
Employee Dimension Structure
A suitable employee dimension table is:
| Emp_Key | Employee_ID | Name | Salary | Designation | Joining_Date | Office_Location | Start_Date | End_Date | Current_Flag |
|---|
Emp_Key is the surrogate key. The business key is Employee_ID. The extra columns Start_Date, End_Date, and Current_Flag are useful if Type 2 history needs to be maintained. Type 2 commonly uses a new surrogate key for each historical version together with administrative history columns.
How to Identify the Change Type
If the change is a correction of a wrong value, it is handled as Type 1 SCD. Type 1 overwrites the old value and does not preserve history. This is suitable for spelling corrections or cleanup changes.
If the change represents a real historical change such as employee transfer, promotion, or salary revision that must be preserved over time, it is handled as Type 2 SCD. Type 2 inserts a new row with a new surrogate key and keeps the old row for historical reporting.
Illustrative Before and After
Before change
| Emp_Key | Employee_ID | Name | Salary | Designation | Joining_Date | Office_Location | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|---|---|
| 101 | E001 | Ravi | 10000 | Developer | 2022-01-10 | Chennai | 2022-01-10 | NULL | Y |
If the change is Type 1
Suppose only a correction is needed, such as fixing office location spelling. The row is overwritten.
| Emp_Key | Employee_ID | Name | Salary | Designation | Joining_Date | Office_Location | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|---|---|
| 101 | E001 | Ravi | 10000 | Developer | 2022-01-10 | Chennai HQ | 2022-01-10 | NULL | Y |
If the change is Type 2
Suppose Ravi is promoted from Developer to Lead, and salary changes from 10k to 40k. The old row is closed and a new row is inserted.
| Emp_Key | Employee_ID | Name | Salary | Designation | Joining_Date | Office_Location | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|---|---|
| 101 | E001 | Ravi | 10000 | Developer | 2022-01-10 | Chennai | 2022-01-10 | 2025-09-20 | N |
| 205 | E001 | Ravi | 40000 | Lead | 2022-01-10 | Chennai | 2025-09-21 | NULL | Y |
This preserves history, which is the reason Type 2 is chosen when past values matter for reporting.
Question 5
Distinguish between
a) OLTP vs OLAP [1 Mark]
b) DWH vs DM [2 Marks]
c) ROLAP vs MOLAP. [2 Marks]
5(a) OLTP vs OLAP
| OLTP | OLAP |
|---|---|
| Supports day-to-day transactions | Supports analytical queries |
| Detailed and current data | Aggregated and historical data |
| Many short transactions | Fewer but complex queries |
| Highly normalised schema | Star or snowflake schema |
| Optimised for transaction speed | Optimised for analytical response time |
5(b) DWH vs DM
A Data Warehouse is an enterprise-wide integrated repository containing historical data from multiple subject areas. A Data Mart is a smaller, subject-oriented or department-oriented subset focused on a specific business process or analytical area. Rows in the bus matrix often correspond to data marts, and first-level data marts are derived from single primary source systems.
5(c) ROLAP vs MOLAP
| ROLAP | MOLAP |
|---|---|
| Stores data in standard relational tables | Stores data in proprietary multidimensional structures |
| Uses star or snowflake schemas | Uses pre-calculated cubes |
| Better scalability for large data volumes | Very fast query response |
| Slower for complex analytical operations | Better for slicing, dicing, and complex calculations |
| Easier drill-through to detail | Practical size limitations and extra investment |
Question 6
Describe the role of Meta Data in DWH and explain its types. [5 marks]
Answer
Metadata is data about data. In a data warehouse, metadata describes the warehouse structure, source-to-target mappings, business meanings, transformation logic, load history, and operational status. It supports integration, ETL control, report consistency, governance, and user understanding. ETL tools in the lesson explicitly refer to metadata management and business/process metadata management, which shows that metadata is central to warehouse operation.
Types of Metadata
1. Technical metadata
This describes structural and technical details such as table definitions, column names, data types, keys, indexes, schema structures, and ETL mappings. It is mainly used by developers, DBAs, and ETL designers. The lesson discusses schema integration, attribute naming inconsistency, and logical data maps, all of which belong to technical metadata.
2. Business metadata
This describes the business meaning of data elements, calculations, KPIs, report labels, and business definitions. It helps users understand what warehouse data means in business terms rather than technical terms. The lecture material on conformed dimensions, measures, and reporting logic depends on this business layer of meaning.
3. Operational metadata
This records operational details such as ETL run time, refresh history, load status, validation results, error logs, and job execution information. The ETL lesson discusses load verification, recovery from load failure, and ongoing ETL control, which are all operational metadata concerns.
Role of Metadata in DWH
Metadata is important because it:
- explains the structure and meaning of warehouse data,
- supports ETL mapping and transformation,
- records lineage from source to warehouse,
- supports load monitoring and recovery,
- improves governance and consistency,
- helps users trust and interpret reports correctly.
Leave a Reply