Apache Kafka from a developer's point of view.
The story will explain how Kafka is working in a bit of depth.
What is Apache Kafka?
Apache Kafka is a horizontally scalable,fault-tolerant, distributed streaming platform.
- Deliver the messages to consumers when they need them.
Kafka combines three key capabilities so you can implement use cases for event streaming end-to-end with a single battle-tested solution:
- To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
- To store streams of events durably and reliably for as long as you want.
- To process streams of events as they occur or retrospectively.
And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.
Kafka Storage Architecture
Apache Kafka organizes the messages in Topics, and the broker creates a log file for each Topic to store these messages. However, these log files are partitioned, replicated, and segmented
Kafka's topic is a logical grouping of messages. If we compare it with the database the topic is like the table where data is stored as the data file.
The topic created inside the Kafka folder will be like this. The topic is nothing more than a folder inside Kafka.
The Topic should be created a number of partition and replication factor.
Once the message is inserted it will go to the folder as below.
There are two types of index one is offset bases other is time-based. These are used by the Kafka broker to identify the message fast and re-execute on a time basis and offset basis.
Topic name -Partition number — offset number
Kafka Cluster Architecture
Kafka Cluster is formed and who is responsible for managing the Kafka Cluster and coordinating work in the cluster.
- Kafka cluster is a group of Kafka brokers who used to work together.
- ZooKeeper manages the Kafka cluster with a number of brokers for scalability. Manages the list of leaders and active brokers.
Kafka Work Distribution Architecture
understand how the work is distributed among the brokers in a Kafka cluster. Primarily focus of this lecture is to explain Fault tolerance and the Scalability of Apache Kafka.
How the topic distributed as shown below
- For achieving the goals even distribution and fault-tolerant Kafka will be used the above way to organize the topic.
Integration tools like Spring Kafka can be used for producer and consuming the message on the topic.
JAVA API for Kafka.
- Producer: https://kafka.apache.org/10/javadoc/?org/apache/kafka/clients/producer/KafkaProducer.html
- Consumer: https://kafka.apache.org/26/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
Kafka Producer Internal
The serializer is a must to transfer data to the remote.
Multithreaded Kafka producer level.
If we need to send data at multiple levels of high frequency.The Kafka topic threads safe.
Introduction to Kafka consumer API and you will be creating a typical consumer.
We could use consumer API to consume the data, transform, and perform the business. The consumer should take care of the rate at which the producer produces the data.
We could make the various consumers consume, each consumer will be attached from the set of consumers.
Explain the role of the offset.
There is a sequential ID number given to the messages in the partitions that we call, an offset. So, to identify each message in the partition uniquely, we use these offsets.
What is a Consumer Group?
Ans. The concept of Consumer Groups is exclusive to Apache Kafka. Basically, every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics.
What is Zookeeper and why is it needed for Apache Kafka?
Zookeeper acts as a centralized service and is used to maintain naming and configuration data and to provide flexible and robust synchronization within distributed systems. Zookeeper keeps track of status of the Kafka cluster nodes and it also keeps track of Kafka topics, partitions etc.
Let’s say that a producer is writing records to a Kafka topic at 10000 messages/sec while the consumer is only able to read 2500 messages per second. What are the different ways in which you can scale up your consumer?
The answer to this question encompasses two main aspects — Partitions in a topic and Consumer Groups.
A Kafka topic is divided into partitions. The message sent by the producer is distributed among the topic’s partitions based on the message key. Here we can assume that the key is such that messages would get equally distributed among the partitions.
Consumer Group is a way to bunch together consumers so as to increase the throughput of the consumer application. Each consumer in a group latches to a partition in the topic. i.e. if there are 4 partitions in the topic and 4 consumers in the group then each consumer would read from a single partition. However, if there are 6 partitions and 4 consumers, then the data would be read in parallel from 4 partitions only. Hence its ideal to maintain a 1 to 1 mapping of partition to the consumer in the group.
Now in order to scale up processing at the consumer end, two things can be done:
- No of partitions in the topic can be increased (say from existing 1 to 4).
- A consumer group can be created with 4 instances of the consumer attached to it.
Doing this would help read data from the topic in parallel and hence scale up the consumer from 2500 messages/sec to 10000 messages per second.
Kafka is a distributed system wherein data is stored across multiple nodes in the cluster. There is a high probability that one or more nodes in the cluster might fail. Fault tolerance means that the data is the system is protected and available even when some of the nodes in the cluster fail.
Download the latest Kafka release and extract it:
$ tar -xzf kafka_2.13-2.7.0.tgz
$ cd kafka_2.13-2.7.0
NOTE: Your local environment must have Java 8+ installed.
Run the following commands in order to start all services in the correct order:
# Start the ZooKeeper service
# Note: Soon, ZooKeeper will no longer be required by Apache Kafka.
$ bin/zookeeper-server-start.sh config/zookeeper.properties