Introduction
This article belongs to a set of 3, where the aim is to introduce Kafka and how to use it.
The 3 parts are:
There is a Git repository that comes along with them here.
This part explains what Kafka is and how it works.
A Bit of History
Kafka was created by LinkedIn to track user activity for posterior analysis. For some kind of analysis the order by which a user access parts of a site matters, for example understanding how a user uses the site or how a hacker abuses the site. So order matters. But as LinkedIn user adoption grew, there was also an increasing need for systems speed and reliability and adopting a Service Oriented Architecture (SOA) helped to overcome that.
LinkedIn's Service Oriented Architecture [1]
But, an ever-growing number of services, creates the challenge of managing all the data pipelines. So the Engineers decided to make Kafka the central point for managing data flows.
Fig. 1 - Kafka as a central point for data pipeline [1]
How it Works
Kafka is part of the general family of technologies known as a queue, messaging or streaming engine. The concept works this way: a set of services produce data (Producers) and sent that data to a central repository. This repository stores the data sequentially like a log. The data is available to any service that requires it (Consumer) and also is accessed sequentially.
Fig. 2 - Representation of a Kafka Cluster with its Brokers and message Topics and Producer and Consumer
Producer, Broker and Consumer
A Broker by definition is an "agent who acts as an intermediary" [2], so in Kafka any computer node (AKA server) is called a "Broker". The mission of a Kafka Broker is to gather all the data given, keep it as long it is needed and make it available as soon as possible.
The entity that inserts data in a Borker is called a Producer.
And the entity that retrieves the data is called a Consumer.
Topic and Partition
A Topic is a collection of messages grouped together. When a message arrives to the Topic, it's given a sequential integer called offset and is stored at the end of a data structure called Log [3].
The Log works like a List where messages are ordered by their offset, stored at the end of the Log and if a Consumer needs to get data, it will consume those messages respecting the order of the offset number. As an example, a consumer will consume messages in the order 0, 1, 2, 3 will not consume the message with offset 3 then 2, it is not intended to be consumed randomly, but ordered.
Fig. 3 - Topic split into 3 Partitions and each Log is incremented
A Topic is split into several parts called Partitions, this allows for easier scalability. By assigning a Partition to a different cluster it becomes easier to increment the number of Consumers of that same Topic.
Fig. 4 - Topics A and B split into Partitions in a Cluster
The best performance is achieved by having the same number of consumers as partitions. Having more consumers than partitions will lead to idle consumers.
Consuming Topics
Kafka Consumers are organized into Consumer Groups. A Consumer Group is a set of Consumers that cooperate to consume data from a topic. The Consumer Group to which a consumer belongs is defined by the group.id
property.
The main concept to understand is that a Consumer only consumes from a Partition and it is the Broker Coordinator of each group to assign Partitions to Consumers, making sure that only one Partition is consumed by only one Consumer of a Consumer Group.
With this rule "A Partition is only consumed by only one Consumer of a Consumer Group" 2 possible scenarios can occur:
- A Partition is consumed by 2 Consumers belonging to different Consumer Groups.
Fig. 4 - Parition consumed by more than one Consumer
This scenario shows a Topic with two Partitions and Partition 1 is consumed by two Consumers. This is possible because the Consumers belong to different Consumer Groups
- Consumer Group has more Consumers than Partitions of a Topic
Fig. 5 - Consumer 2 is inactive because has no Partition available
As there are more Consumers than Partitions of a given Topic, the Consumer 2 becomes inactive.
Partition Assignment
When a Consumer joins a Consumer Group (this is done by setting the group.id
in the Consumer), it will connect to the first Broker it can find and makes a FindCoordinator request, and receives the Group Coordinator's endpoint.
Fig. 6 - Find Group Coordinator
Then connects to the Group Coordinator. If the Consumer is the first of the Consumer Group, the Group Coordinator will make it the Group Leader and receive all the current state about partitions available and partition consumption for that Consumer Group. If the Consumer is not a Group Leader, receives only a member Id.
Fig. 7 - Members Join
The Group Leader will decide which Partitions will be assigned to which Consumer and communicate it to the Group Leader in a SyncGroupRequest. The other Consumers will perform a SyncGroupRequest only to receive their assignment - Broker and Partition to consume.
Fig. 7 - Partitions Assignment
Partition Consumption
While a Consumer is processing a Partition, periodically informs the Group Coordinator of successfully processed offset.
Fig. 8 - Partition consumption
If for any reason a Consumer crashs (the new Consumer instance will be assigned a Partition from the Group Coordinator) or a rebalance is evoked (the Consumer may start processing a different Partition), the Consumer must know from which offset it must start processing. So the Consumer sends an OffsetRequest to Group Coordinator and receives the offset from where must start to process events.
Fig. 9 - Partition offset tracking
As the offset commit is not done after each successful processing event , but is done after a batch of events is processed by a Consumer, when rebalancing or a new Consumer is instantiated, it will resume work from the offset provided by the Group Coordinator which has the data from the last offset commit and that offset will be of an inferior number leading to processing of previously processed messages.
There are startegyes to diminished that drawback that can be analised here
References
[1] “A brief history of scaling linkedin,” LinkedIn Engineering. [Online]. Available: https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin. [Accessed: 05-Jun-2023]
[2] “Broker definition & meaning,” Merriam-Webster. [Online]. Available: https://www.merriam-webster.com/dictionary/broker. [Accessed: 05-Jun-2023]
[3] “Documentation,” Apache Kafka. [Online]. Available: https://kafka.apache.org/documentation/#log. [Accessed: 05-Jun-2023]
[4] S. Kozlovski, “Apache Kafka Data Access Semantics: Consumers and membership,” Confluent. [Online]. Available: https://www.confluent.io/blog/apache-kafka-data-access-semantics-consumers-and-membership/. [Accessed: 05-Jun-2023]
[5] N. Narkhede, G. Shapira, and T. Palino, “Chapter 4: Kafka Consumers: Reading Data from Kafka,” in Kafka: The Definitive Guide, 1st ed., Sebastopol, CA: O’Reilly, 2017, pp. 68–68.
[6] Confluent, “Consumer Group Protocol: Scalability and Fault Tolerance,” Confluent. [Online]. Available: https://developer.confluent.io/learn-kafka/architecture/consumer-group-protocol/. [Accessed: 06-Jun-2023]
[7] N. Narkhede, G. Shapira, and T. Palino, “Chapter 4: Kafka Consumers: Reading Data from Kafka,” in Kafka: The Definitive Guide, 1st ed., Sebastopol, CA: O’Reilly, 2017, pp. 77–81.