Part 1 - Kafka Key Terminology

This section focuses on the key terminology and concepts about Apache Kafka. And also note that each has its contribution to Apache Kafka Architecture.

Distributed Messaging System

Kafka is a distributed messaging system which is designed for:

high throughput
reliability
resilient
fast movement of data through the more and more diverse system, particularly when data is growing

Producers / Publisher

A publisher creates some data and sends it to a broker. Producers are simply an applications that you write to send the messages.

Consumers / Subscriber

A subscriber retrieves the message and processes it from the broker to which it is interested and authorized. Consumers are simply applications that you write to receive the messages.

Topics

It is the collection of messages or grouping of messages. And every topic has its name, which is used by producers and consumers to send/receive messages. Topics can be defined upfront or on-demand as needed.

Broker or Worker or Node

Kafka broker is a software process also referred to as executable or daemon service that runs on a machine that can be physical or virtual.

Cluster

A kafka cluster is the grouping of multiple kafka brokers which can be on a single machine or different machines.

Controller

A controller is just a worker node or broker like any other. It just happened to be elected from amongst its peer to officiate in the administrative capacity of a controller. Below are the critical responsibilities of the controller:

To maintain an inventory of what workers are available to take on work.
To maintain a list of work items that have been assigned to workers.
To maintain the status of workers
To manage the replication factor and task redundancy in case of failure

Leader

A leader is also a worker node or broker like any other. It is elected by the controller for assigning the task. It will peer with other worker nodes based on the replication factor to complete its task. Once a peer has committed to its leader, the quorum is formed. And these peer takes all the task that is assigned to its leader.

Communication and Consensus

Virtually every component within a cluster has to keep some form of communication going between themselves. In a distributed system, these communications are called consensus. Besides the obvious data payloads being transferred as messages, there are other types of network communication happening that keep the cluster operating normally. These are some of them:

Worker node membership and naming
Configuration management - startup-config
Leader election
Health status

Zookeeper

In the context of Apache Kafka, ZooKeeper serves as a centralized service for maintaining metadata about a cluster of distributed nodes, which are:

Configuration information
Health and Synchronization Status
Cluster and Quorum Group membership
Roles of elected nodes

Also, it is the responsibility of the Zookeeper to assign a broker to be responsible for the topic.

Topic message order

Every message that comes to the broker is grouped by topics and is stored inside a partition. And each partition stores these messages in an ordered sequence (by time) and immutable by nature means cannot be updated once it is received by the broker. But in the case of multiple partitions, an order cannot be guaranteed and it’s the responsibility of the consumer to handle this use case.

Event Souring

An architectural style or approach to maintaining an application’s state by capturing all changes as a a sequence of time-ordered, immutable events.

Message Content

Every message that is consumed by the broker contains these three properties:

Timestamp
Reference-able identifier
Payload (binary)

Offset

It is the last read message position that is maintained by the Kafka consumer. It also corresponds to the message identifier.

Message Retention Policy

It is the period for how long messages will be stored in the broker. By default, it is set to 168 hours or seven days. The retention period is defined on a per topic basis. Physical storage can also constrain the message retention policy.

Distributed Commit Log (Transaction or Commit Logs)

It is the series of commit logs that can also be used for event sourcing. Following are the key properties of it:

Source of truth
Physically stored and maintained
Higher-order data structures derived from the log
- Tables, indexes, views, etc
Point of recovery
Basis for replication and distribution

Kafka Partitions

It is the physical representation of the topic as a commit log stored on one or more brokers. It can be found inside /tmp/kafka-logs/{topic}-{partition} directory, as per our example, /tmp/kafka-logs/my_topic-0 directory will be created. This directory will contain an index, log files, etc. These are the main properties of partitions:

Partitions == Log
Each topic has one or more log files called partitions
Basis for
- Scaling
- Fault-tolerant
- Higher levels of throughput
At-least one partition is required for one or more broker

Replication Factor

It is the factor that decides how much data redundancy should happen inside the cluster. And this is configured on a per-topic basis. These are the features that it provides:

Reliable work distribution
- Redundancy of messages
- Cluster resiliency
- Fault-tolerance
Guarantees
- N-1 broker failure tolerance
- 2 or 3 minimum
Broadcast In-Sync Replicas or ISR to the cluster is equal to the replication factor for that topic and partition

Notes

The scalability of Apache Kafka is determined by the number of partitions being managed by multiple broker nodes

Apache Kafka Series