Q 1 a): A data engineer posits that “streaming data is often loosely structured.” Critically evaluate this assertion by presenting well-reasoned arguments either in support of or against the claim.

A: The assertion that “streaming data is often loosely structured” can be supported by several arguments.

  1. Variety and Volume: Streaming data often comes from various sources (e.g., sensors, social media, IoT devices), which means it can vary greatly in format and structure. Unlike traditional batch processing, which typically operates on structured data, streaming data may include unstructured or semi-structured elements such as JSON, XML, or plain text.
  2. Dynamic Nature: The dynamic nature of streaming data leads to variations in data types and schema. As new data flows continuously, it may not conform to a predefined schema, making it loosely structured. For instance, data from a live tweet feed can include hashtags, mentions, and URLs, which can change rapidly and are not uniformly structured.
  3. Flexibility in Processing: Loosely structured data allows for more flexibility in processing and real-time analytics. Data engineers often need to adapt to changing data landscapes, handling variations without predefined schemas using techniques like schema-on-read.

In summary, while streaming data may be loosely structured, this characteristic enables flexibility, adaptability, and real-time processing capabilities essential in modern data-driven applications.

Q1 b): List two benefits of Sequential nodes of Zookeeper.

A:

  1. Ordered Delivery: Sequential nodes in Zookeeper provide a guarantee that they will be delivered in the order they were created. This is crucial for scenarios where the order of operations matters, such as in distributed locking mechanisms, ensuring that processes access shared resources in a consistent manner.
  2. Unique Identification: Each sequential node is assigned a unique and monotonically increasing number appended to its name. This enables easy identification and tracking of nodes, facilitating their use in coordinating distributed applications, such as leader election and task assignment.
  3. Leader Selection: Sequential nodes can be used to implement leader selection algorithms in distributed systems. By creating a sequential node for each instance, the system can easily identify the node with the lowest number as the leader, facilitating coordination among distributed processes.

Q1 c): Explain the most common interaction patterns associated with the data ingestion layer with an example.

A: The most common interaction patterns associated with the data ingestion layer include:

  • Push pattern (often implemented with pub‑sub)
    In this pattern, the source system sends data to the ingestion layer as soon as it is available. In many modern architectures, this is done through a pub‑sub message broker such as Kafka, Pulsar, or Google Pub/Sub.
    Example: A web application publishes user‑click events to a Kafka topic, and an ingestion pipeline subscribes to that topic to continuously write data into a data lake or warehouse.
  • Pull pattern
    Here, the ingestion layer actively connects to the source on a schedule and fetches data.
    Example: A daily ETL job pulls updated customer records from a relational database and loads them into a data warehouse.
  • Polling / near‑real‑time pattern
    The ingestion layer repeatedly checks the source at short intervals (for example, every few seconds) for new data, usually using a timestamp or offset.
    Example: A process polls an IoT gateway every few seconds for sensor readings with timestamps greater than the last processed value and appends them to a streaming table.

These interaction patterns enable organizations to utilize data efficiently, depending on their specific needs for timeliness and volume.

Q2. Consider a real estate portal say, 150hectares.com, which allows users to list and view the various types of properties on the portal. The inbuilt data processing system captures the events associated with the visitor interaction with the portal so that the business owners can infer customer location, preferences and investment capacities.  The insights on customer reviews, locality and price trends etc., will be showcased and the firm plans to propose suitable housing solutions to the customers with competitive pricing based on the site exploration. List out various workloads in the use case and propose an architecture with diagram, clearly discussing the data processing class. Discuss various activities associated with the components in the architecture pertaining to the use case.

Real Estate Portal Data Processing Architecture

In the context of a real estate portal like 150hectares.com, various workloads can be identified based on user interactions and data processing requirements. Below is a comprehensive overview, along with a proposed architecture and detailed activities associated with each component.

Workloads

  1. User Interaction Tracking
    • Capturing user clicks, page views, and interaction patterns to analyze preferences.
  2. Data Ingestion
    • Collecting data from user interactions, property listings, and customer reviews in real-time.
  3. Data Storage
    • Storing structured and unstructured data from various sources for easy retrieval and analysis.
  4. Data Processing and Analytics
    • Processing the ingested data to uncover insights related to customer preferences, locality trends, and pricing.
  5. Machine Learning
    • Implementing models to predict customer behavior and recommend suitable properties based on past interactions.
  6. Reporting and Dashboarding
    • Providing visual insights on trends, user behavior, and property performance to stakeholders.

Proposed Architecture

            +--------------+
            |  User Front  |
            +------+-------+
                   |
                   v
            +--------------+
            | Data Ingestion|
            +------+-------+
                   |
        +----------+-----------+
        |                      |
        v                      v
+---------------+    +---------------------+
|  Data Storage |    |  Streaming Processing|
| (Data Lake/   |    +---------------------+
|  Data Warehouse|            |
+---------------+             v
        |             +---------------------+
        |             |       Analytics     |
        |             | (ML, Reporting)     |
        |             +---------------------+
        |                      |
        |                      v
        |             +---------------------+
        |             |   Visualization     |
        +------------> |  & Reporting Layer  |
                      +---------------------+





Component Activities

  1. User Frontend
    • Users interact with listings, submit inquiries, and leave reviews.
    • This layer captures user events via JavaScript or similar technologies.
  2. Data Ingestion Layer
    • Collects real-time data from the user frontend, leveraging tools like Apache Kafka for streaming or batch processes.
    • Ingests data such as user interactions (clicks, searches), property listings, and reviews.
  3. Data Storage Layer
    • Stores data in a structured format (e.g., relational databases) and an unstructured format (data lakes).
    • Uses technologies like Amazon S3 for data lakes and PostgreSQL for structured data storage.
  4. Streaming Processing Layer
    • Processes incoming data streams to enrich and transform data for immediate insights or later use.
    • This may involve real-time aggregations and processing using tools like Apache Flink or Spark Streaming.
  5. Analytics Layer
    • Utilizes machine learning models to analyze preferences, recommend properties, and generate insights on trends.
    • This encompasses training models on historical data and applying them to current interaction streams.
  6. Visualization & Reporting Layer
    • Develops dashboards to represent insights graphically for business stakeholders.
    • Uses BI tools like Tableau or Power BI for reporting on performance metrics and customer analysis.

Conclusion

The proposed architecture supports a comprehensive approach to processing data for a real estate portal. By managing user interactions effectively and leveraging analytics, stakeholders can gain valuable insights into customer behavior and market trends, ultimately aiding in crafting suitable housing solutions and competitive pricing. Each component’s activities contribute to building a robust system for data-driven decision-making.

Q3. (a) Consider a stream processing system design, where the maximum throughput among the producers in the system is 5MB/sec. Given that it can support 14 consumers and the expected throughput for the system is 42GB/minute. Estimate the number of producers that can be associated with the system.

Part 1: Estimate the number of producers

Given:

  • Maximum throughput per producer: 5 MB/sec
  • System supports 14 consumers
  • Expected system throughput: 42 GB/min

Step 1: Convert system throughput to MB/sec.42 GB/min=42,000 MB/minPer second=42,000 MB60=700 MB/secStep 2: Each producer can send 5 MB/sec, so the number of producers needed is:Number of producers=700 MB/sec5 MB/sec=140Answer:
The system can be associated with 140 producers.

(b) In a streaming system, the average message size is 32kB. Discuss storage requirements for dataflow layer for each of the following scenarios considering the data ingestion rate (Ri) and data processing rate (Rp) are given below:
Ri =360MB/hr, Rp =30MB/min (ii) Ri =180MB/min, Rp =36MB/hr

The storage requirement in the dataflow layer depends on the mismatch between data ingestion rate and data processing rate. If ingestion rate is greater than processing rate, backlog accumulates and additional storage is required. If processing rate is greater than ingestion rate, backlog does not build up and only transient buffering is needed.

Average message size = 32 kB=0.032 MB.
We assume the dataflow layer needs to store all unprocessed data until the processor “catches up” (i.e., backlog = RiRp during the period of imbalance).

We’ll work in MB/min for both cases.
(i) Ri=360 MB/hrRp=30 MB/min

Convert Ri to MB/min:Ri=360 MB60 min=6 MB/minRp=30 MB/minSince Rp>Ri, the processing rate is higher than the ingestion rate.
So the system can keep up and there is no accumulating backlog over time.

Storage requirement:

Since processing rate is much higher than ingestion rate, the system can process incoming data faster than it arrives. Therefore, there is no persistent backlog accumulation in the dataflow layer. Only minimal temporary buffering is needed for message queuing and transmission smoothing.

Only temporary buffering is needed (e.g., in‑memory queues or small disk buffers).

(ii) Ri=180 MB/minRp=36 MB/hr

Convert Rp to MB/min:Rp=36 MB/hr60=0.6 MB/minRi=180 MB/minNow RiRpRiRp, so a large backlog accumulates over time.

Backlog per minute:Backlog per minute=1800.6=179.4 MB/minIf the system runs for t minutes, the unprocessed backlog is:Storage needed179.4×t MBStorage needed≈179.4×t MB

Discussion:

  • The dataflow layer needs significant storage for the backlog.
  • The storage can grow linearly over time if the processing rate does not increase.
  • Practical designs would either:
    • Scale up the processor (increase Rp), or
    • add sliding retention windows (e.g., drop data older than N minutes) so the layer stores only a bounded amount of recent data.

Q4. Discuss the aspects of message delivery semantics for the following use cases: A live streaming platform broadcasts football match scores to millions of users. Scores change frequently (e.g., when goals are scored or match time updates). It has provision to refresh the screen periodically. An online shopping platform records a customer’s order once they confirm the checkout process.

Message Delivery Semantics

Message delivery semantics mainly include at-most-once, at-least-once, and exactly-once delivery.

At-most-once means a message may be lost but is never redelivered.
At-least-once means a message is never lost, but duplicates may occur.
Exactly-once means each message is delivered and processed once and only once.

Below, we discuss the key aspects of message delivery semantics for both scenarios.

1. Live Streaming Platform (Football Match Scores)

Use Case: A live streaming platform broadcasts football match scores to millions of users, with frequent score updates:

  • Delivery Guarantees: In this scenario, the platform requires at-least-once delivery semantics. This means that every score update must reach every user, even if it leads to occasional duplicates. Given the high frequency of updates, missed updates could lead to users receiving outdated information, which is unacceptable in a live sports context.
  • Message Ordering: The scores must appear in the order they happen (e.g., a goal scored at the 30th minute should be reported before one scored at the 45th minute). Ensuring message ordering is critical for the integrity of the information being conveyed.
  • Latency Requirements: Low latency is essential. Users expect real-time updates; delays in score reporting can lead to dissatisfaction. Hence, the platform should prioritize rapid message delivery, ideally in real-time with very minimal delays.
  • Refresh Mechanism: The screen refresh provision allows users to get the latest updates; hence, the application should handle the refreshing of the display without missing any significant updates. Utilizing techniques like windowing can help in grouping updates to optimize network utilization while ensuring timely delivery.

Windowing in Stream Processing

What is Windowing?

Windowing is a technique used in stream processing to group a continuous stream of data into manageable segments known as “windows.” These windows allow for the organization of data based on specific criteria, such as time or count, enabling the processing of data in discrete chunks rather than as an unbounded stream. Windowing facilitates the application of aggregations, computations, and analytics on the grouped data over defined intervals.

Types of Windows

  1. Time-based Windows:
    • Tumbling Window: Fixed-size, non-overlapping windows that do not allow for any data to be shared between windows. For example, a window that captures data every minute.
    • Sliding Window: Overlapping windows that slide over the data stream, allowing data points to be part of multiple windows. For example, a window that captures data every 30 seconds but overlaps to include data from the previous minute.
    • Session Window: Groups events that are related to each other within a defined inactivity period. If no event occurs within a specified duration, the window closes.
  2. Count-based Windows: Windows are defined by the number of events rather than time. For example, a window that processes every 100 events.

Where is Windowing Useful?

  1. Real-time Analytics: Windowing is essential in real-time data analysis applications, such as monitoring streaming metrics (e.g., user clicks, sensor data) to perform calculations like averages or sums over specific time frames.
  2. Event Aggregation: In scenarios where data arrives continuously, windowing helps in aggregating events for reporting purposes. For instance, tracking total sales per minute or the number of active users per hour.
  3. Handling Late Data: With windowing, late-arriving data can still be processed by defining windows that allow adjustments or corrections based on the arrival of new data after the initial calculation.
  4. Stateful Computations: In stateful stream processing architectures, windowing enables the maintenance of intermediate states, which are essential for operations like joins, group by, and maintaining historical context.
  5. Batch-Inspired Operations: Windowing allows streaming systems to apply batch-like operations over streams, making it suitable for scenarios where batch processing logic is required for real-time data.

Conclusion

Windowing is a powerful concept that enhances the capability of stream processing systems by providing a structured approach to manage and analyze continuous data streams effectively. Its application across various domains, including finance, IoT, and web analytics, showcases its significance in extracting meaningful insights from transient data.

2. Online Shopping Platform (Order Confirmation)

Use Case: An online shopping platform records a customer’s order once they confirm the checkout process:

  • Delivery Guarantees: This use case typically employs exactly-once delivery semantics. It is crucial that each confirmed order is recorded only once to prevent duplicate orders or, worse, financial discrepancies. The system must ensure that every transaction is transactionally consistent.
  • Message Ordering: Order placements should also maintain strict ordering, especially when multiple items are included in a single transaction. Users expect that their order will reflect items in the correct sequence and details.
  • Latency Requirements: While low latency is important, the application can afford slightly higher delays in comparison to live streaming. Order confirmation messages need to reach the server, but instantaneous feedback to the user could be achieved through client-side notifications and buffering.
  • Transactional Integrity: The order processing system should ensure robust transaction handling, confirming order records and payment processes collectively. Implementing mechanisms for retries and compensations in case of failures can enhance reliability.

Conclusion

Both use cases highlight different priorities in message delivery semantics. The live streaming platform prioritizes speed and delivery frequency in a real-time context, while the online shopping platform emphasizes transactional accuracy and consistency. Understanding the specific requirements and semantics of each use case is crucial for designing effective communication protocols and architectures.

Leave a Reply