Introduction

Apache Kafka is a powerful and versatile distributed messaging system designed to handle the demands of modern data processing. It enables seamless communication between applications by facilitating the transfer of messages through various messaging models. With its robust architecture, Kafka supports high throughput and low latency, making it suitable for real-time data pipelines and analytics. This post explores the fundamental concepts of Kafka. It includes its messaging systems, components, and workflows.

At A Glance

Messaging Systems

A messaging system is a mechanism for transferring data from one application to another. It allows an application to concentrate on its core business logic while the messaging system handles data transfer and sharing. This separation improves modularity and supports communication between distributed components. Two common messaging models are point-to-point messaging and publish-subscribe messaging.

Point-to-Point Messaging System

In a point-to-point messaging system, messages are stored in a queue. One or more consumers may be connected to the queue, but each message is consumed by only one consumer. Once a consumer reads the message, that message is removed from the queue. This model is useful when each piece of work should be processed only once.

Publish-Subscribe Messaging System

In a publish-subscribe system, messages are stored in a topic rather than a queue. Multiple consumers can read the same message at the same time. Message producers are called publishers, and message consumers are called subscribers. This model is suitable when the same data must be delivered to many consumers simultaneously.

Apache Kafka

  • Apache Kafka is a distributed publish-subscribe and queuing messaging system designed for high-throughput distributed environments.
  • It is suitable for large-scale data processing and supports both offline and online message transfer.
  • Messages are persisted on disk and replicated across the cluster to reduce the risk of message loss.
  • Kafka also integrates easily with systems such as Storm and Spark.

Benefits of Kafka

Kafka provides reliability because it is distributed, partitioned, replicated, and fault tolerant.

  • It provides scalability because machines can be added to the cluster without downtime.
  • It provides durability because messages are persisted to disk quickly.
  • It also offers high performance by supporting high throughput for both publishing and subscribing to messages.

These properties make Kafka suitable for large-scale real-time data systems.

Kafka Fundamentals

The main concepts in Kafka are producers, topics, consumers, brokers, partitions, replicas, leaders, and followers.

  • Producers send messages.
  • Topics organize messages into categories.
  • Consumers read messages.
  • Brokers store and manage published data.
  • Partitions divide a topic into ordered units of storage and processing.
  • Replicas serve as backups of partitions.
  • Leaders handle read and write operations for a partition.
  • Followers replicate the leader’s data. They can take over if the leader fails.

Topics

A topic is a stream of messages belonging to a particular category. Data is stored in topics. Topics are divided into partitions, and Kafka keeps at least one partition for each topic. Each partition contains messages in an immutable ordered sequence.

A partition is implemented as a set of segment files of equal sizes.

Topics allow Kafka to organize data logically while partitions allow it to scale physically.

Partitions

A topic may contain many partitions so that it can handle an arbitrary amount of data. Every message in a partition has a unique sequence identifier called an offset.

Replicas act as backups of partitions. These replicas are not used for normal reading or writing; their purpose is to prevent data loss.

Partitioning allows Kafka to distribute workload across brokers and improve scalability.

Brokers

A broker is a Kafka server responsible for maintaining published data. Each broker may contain zero or more partitions for a topic.

Brokers help maintain load balance in the cluster and are stateless, so they rely on ZooKeeper to maintain cluster state.

If there are N partitions and N brokers, each broker can hold one partition. If the number of partitions and brokers differs, partitions are distributed across brokers accordingly. Brokers are the central storage and management units of Kafka.

Producers

A producer is the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. When a producer publishes a message, the broker appends it to the last segment file of the relevant partition. Producers may also choose the partition to which the message should be sent. Kafka producers do not normally wait for acknowledgments from the broker. They can send messages as fast as the broker can handle them.

Consumers

Consumers read data from brokers. A consumer subscribes to one or more topics and consumes published messages by pulling data from the brokers. Brokers are stateless regarding consumer progress. Therefore, the consumer must keep track of how many messages have been consumed by using partition offsets. If a consumer acknowledges a certain offset, it means all previous messages have been consumed. Consumers may also rewind or skip to any point in a partition by supplying an offset value. Offset information is notified through ZooKeeper.

Kafka Cluster

A Kafka system containing more than one broker is called a Kafka cluster. A cluster can be expanded without downtime. Kafka clusters are used to manage message persistence and replication across multiple servers. This cluster-based design supports both high availability and scalability.

Leader and Follower Nodes

For every partition, one server acts as the leader. The leader is responsible for all read and write operations for that partition. Other servers act as followers. A follower obeys the leader’s instructions, pulls messages from the leader, and updates its own data store. If the leader fails, one of the followers automatically becomes the new leader. This arrangement supports fault tolerance and continuity of service.

Role of ZooKeeper in Kafka

ZooKeeper is used to manage and coordinate Kafka brokers. It notifies producers and consumers when a new broker joins the system or when an existing broker fails. Based on these notifications, producers and consumers can coordinate their work with another available broker. ZooKeeper therefore plays a key coordination role in the functioning of Kafka clusters.

Kafka Workflow

Kafka can be understood as a collection of topics split into one or more partitions. A partition is a linearly ordered sequence of messages, and each message is identified by its index or offset. All data in a Kafka cluster is the disjoint union of partitions. Incoming messages are written at the end of a partition, and consumers read them sequentially. Kafka combines pub-sub and queue-based messaging in a fast, reliable, persistent, fault-tolerant, and zero-downtime manner.

Workflow of Pub-Sub Messaging

In pub-sub mode, producers send messages to a topic at regular intervals. Kafka brokers store these messages in the partitions configured for that topic and distribute them among partitions. A consumer subscribes to the topic, receives the current offset, and that offset is stored in ZooKeeper. The consumer then requests new messages regularly. Kafka forwards the messages to the consumer, which processes them and sends an acknowledgment. Kafka updates the offset value and stores the new value in ZooKeeper. Because offsets are maintained, the consumer can continue correctly even after server outages. The same cycle continues until the consumer stops requesting messages. A consumer may also rewind or skip to a chosen offset and continue reading from that point.

Workflow of Queue Messaging and Consumer Groups

In queue-based operation, a group of consumers with the same Group ID subscribes to a topic.

Consumers with the same Group ID are treated as a single group, and messages are shared among them. Initially, a single consumer may subscribe to a topic. Kafka interacts with this single consumer as it would in pub-sub mode.

When another consumer with the same Group ID joins, Kafka changes to share mode. It then divides the messages between the consumers. This sharing continues until the number of consumers equals the number of partitions for that topic.

If the number of consumers exceeds the number of partitions, the extra consumers remain idle. They do not receive messages until an existing consumer unsubscribes.

Kafka Server Properties File

Kafka server operation depends on configuration values stored in the server properties file. Important properties include the broker identifier. The port on which the socket server listens is another crucial property. Additionally, the directories in which log files are stored are significant as well. Each broker must have a unique broker ID. These configuration values define the identity and operating parameters of a Kafka broker.

Kafka Command Line Interface

Kafka provides command-line tools for server management, topic lifecycle management, and message production and consumption. The ZooKeeper server can be started using the ZooKeeper configuration file. The Kafka server can be started using the Kafka server properties file. Topic lifecycle commands support creating, listing, updating, and deleting topics. Separate console tools are used for producers and consumers. These tools are important for testing, administration, and basic interaction with Kafka.

Kafka Java APIs

Kafka also provides Java APIs for building producers and consumers programmatically. The slide set mainly lists this as part of the agenda. However, it does not expand it in detail. The broader workflow makes clear that Kafka applications use programmatic interfaces. These interfaces connect to brokers, publish messages to topics, subscribe to topics, and consume messages using offsets.

Conclusion

Kafka is a distributed messaging system that combines the strengths of publish-subscribe and queue-based communication. Its main architectural ideas are topics, partitions, brokers, producers, consumers, replication, leader-follower coordination, and offset-based consumption. Kafka is reliable, scalable, and durable. It achieves high performance because it persists messages to disk. Additionally, it distributes data across brokers, replicates partitions, and supports consumer groups. A strong exam answer should clearly explain the difference between point-to-point and pub-sub messaging. It should define the key Kafka components. It should also describe how Kafka workflow operates in both pub-sub and queue-based modes.

Q&A for Apache Kafka

Q1: What is Apache Kafka?

A1: Apache Kafka is a distributed messaging system designed for high-throughput and low-latency data processing. It facilitates communication between applications by transferring messages through various messaging models like publish-subscribe and queuing.

Q2: What are the two common messaging models?

A2: The two common messaging models are point-to-point messaging and publish-subscribe messaging.

Q3: How does a point-to-point messaging system work?

A3: In a point-to-point messaging system, messages are stored in a queue. Each message is consumed by only one consumer. A consumer reads a message and removes it from the queue. This process ensures that each piece of work is processed only once.

Q4: What distinguishes a publish-subscribe messaging system from point-to-point?

A4: In a publish-subscribe system, messages are stored in a topic instead of a queue. This setup allows multiple consumers to read the same message simultaneously. Publishers send messages, while subscribers receive them.

Q5: What are the main components of Apache Kafka?

A5: The main components of Apache Kafka include producers, topics, consumers, brokers, partitions, replicas, leaders, and followers.

Q6: What is the role of a producer in Kafka?

A6: A producer is responsible for publishing messages to one or more Kafka topics. They send data to Kafka brokers, which append it to the relevant partition.

Q7: How do consumers interact with Kafka?

A7: Consumers read messages from brokers by subscribing to one or more topics. They pull messages from the brokers and keep track of which messages have been consumed using partition offsets.

Q8: What is a Kafka cluster?

A8: A Kafka cluster consists of multiple brokers that work together to manage message persistence and replication. It can be expanded without downtime, supporting high availability and scalability.

Q9: What is the function of ZooKeeper in Kafka?

A9: ZooKeeper is used to manage and coordinate Kafka brokers. It notifies producers and consumers about broker availability, ensuring that message processing continues smoothly.

Q10: How does Kafka achieve fault tolerance?

A10: Kafka achieves fault tolerance through its leader-follower architecture. One broker acts as the leader for each partition. It manages all read and write operations. If the leader fails, a follower automatically takes over, ensuring continuity of service.

Q11: What are message offsets, and why are they important?

A11: Message offsets are unique sequence identifiers assigned to each message in a partition. They allow consumers to track which messages have been read. Consumers can also rewind or skip to any point in a partition.

Q12: How does the workflow differ between pub-sub messaging and queue-based messaging in Kafka?

A12: In pub-sub messaging, multiple consumers can read the same message from a topic. In queue-based messaging, consumers within the same Group ID share messages. Each message is processed by one consumer only. This allows for load balancing among consumers.

Leave a Reply