My notes from the Udemy course: Apache Kafka for Beginners.
Kafka Theory
- Topics: a particular stream of data.
- It is similar to a table in a database, without all the constraints.
- A topic is identified by its name.
- Partitions: topics are split in partitions. They are ordered, and each message within a partition gets an incremental ID, called offset. You need to define the number of partitions when creating a topic.
- Offsets has only meaning for a specific partition.
- Order is guaranteed only within a partition.
- Once data is written to a partition, it cannot be changed (immutability).
- Data is assigned randomly to a partition unless a key is provided.
- Brokers are nothing else than servers, which are used to compose a Kafka cluster.
- Each broker has its own ID (integer).
- Each broker contains only certain topic partitions, but not all of them.
- A good number to get started is 3 brokers.
- Topic replication factor
- Topics should have a replication factor >1 (usually between 2 and 3).
- This way, if a broker is down, another broker can serve the data.
- Leader for a partition defines a broker that is a leader for a given partition, which means receiving and serving the data for a partition. The other broker]s (ISR - in-sync replicas) will synchronize the data.
- Producers
- Producers write data to topics.
- Producers automatically know to which broker and partition to write to.
- Producers automatically recover when a broker fails.
- Producers can choose to receive acknowledgement of data writes:
- acks=0: producer will not wait for acknowledgement (possible data loss)
- acks=1: producer will wait for leader acknowledgement (limited data loss)
- acks=all: producer will wait for leader and replicas acknowledgement (no data loss)
- Message keys: a key can be sent by the producer, which will route the message to a certain partition. If no key is sent, data is routed via round robin.
- A key must be sent if you need message ordering for a specific field.
- All messages with the same key go to the same partition.
- Consumers
- Consumers read data from a topic (identified by a name).
- Consumers know which broker to read from.
- Consumers automatically recover when a broker fails.
- Data is read in order within each partition.
- Consumers read data in consumer groups. Each consumer within a group reads from exclusive partitions; if there are more consumers than partitions, some consumers will be inactive.
- Consumer offsets
- Kafka stores the offsets at which a consumer group has been reading, which are commited to a topic called __consumer_offsets.
- If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
- Consumers can choose when to commit offsets: at most once, at least once (preferred) and exactly once.
- Kafka Broker Discovery
- Every Kafka broker is also a “bootstrap server”, which means they about every other broke, topics and partitions (metadata).
- Once you connect to one broker, you will be connected to the entire cluster.
- Zookeeper
- Zookeeper manages the brokers, and performs leader election for partitions.
- Zookeeper sends notification to Kafka in case of changes (e.g. new broker, broker dies, delete topic, etc).
- Kafka cannot work without Zookeeper.
- Zookeeper operates by design with an odd number of servers (3, 5, 7…).
- Kafka guarantees
- Messages are appended to a topic-partition in the order they are sent.
- Consumers read messages in the order stored in a topic-partition.
- With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down.
- As long as the number of partitions for a topic remains constant, the same key will always go to the same partition.
Kafka Command Line Interface
Kafka Topics
kafka-topics
can be used to manage topics. You must specify Zookeeper when running any operation.
Create a topic
kafka-topics \
--zookeper 127.0.0.1:2181 \
--topic my-kafka-topic \
--partitions 3 \
--replication-factor 1 \
--create
List topics
kafka-topics \
--zookeper 127.0.0.1:2181 \
--list
Describe a topic
kafka-topics \
--zookeper 127.0.0.1:2181 \
--topic my-kafka-topic \
--describe
Delete a topic
kafka-topics \
--zookeper 127.0.0.1:2181 \
--topic my-kafka-topic \
--delete
Kafka Console Producer
kafka-console-producer
reads data from standard input and publishes it to Kafka.
Produce a message
kafka-console-producer \
--broker-list 127.0.0.1:9092 \
--topic my-kafka-topic
‘CTRL + C` will interrupt the input.
Produce a message with properties
kafka-console-producer \
--broker-list 127.0.0.1:9092 \
--topic my-kafka-topic
--producer-property acks=all
Kafka Console Consumer
kafka-console-consumer
reads data from Kafka and outputs it to standard output.
Consume a topic
kafka-console-consumer \
--bootstrap-server 127.0.0.1:9092 \
--topic my-kafka-topic
Consume a topic from beginning
kafka-console-consumer \
--bootstrap-server 127.0.0.1:9092 \
--topic my-kafka-topic \
--from-beginning
Consume a topic with group
kafka-console-consumer \
--bootstrap-server 127.0.0.1:9092 \
--topic my-kafka-topic \
--group my-application
Kafka Consumer Groups
kafka-consumer-groups
lists, describes, deletes and resets offsets for consumer groups.
List consumer groups
kafka-consumer-groups \
--bootstrap-server 127.0.0.1:9092 \
--list
Describe a consumer group
kafka-consumer-groups \
--bootstrap-server 127.0.0.1:9092 \
--group my-application \
--describe
Reset the offset for a group
kafka-consumer-groups \
--bootstrap-server 127.0.0.1:9092 \
--group my-application \
--reset-offsets \
--to-earliest \
--topic my-kafka-topic \
--execute
Advanced Configuration
Idempotent Producers
- An idempotent producer can be defined, so no duplicates are introduced on network errors.
- Idempotent producers guarantee a safe and stable pipeline.
- It can be set via the property
enable.idempotent
.
Message Compression
- Compression is enabled at the Producer level and does not require changes in the Brokers or in the Consumers.
- It can be set via the property
compression.type
, which supports the following values:none
,gzip
,lz4
,snappy
. - Tweaking the properties
linger.ms
andbatch.size
can help improve compression and higher throughput when using compression.
Kafka Ecosystem
- Kafka Connect: using the Connect APIs, it simplifies and improves getting data in and out of Kafka.
- Kafka Streams: provides data processing and transformation within Kafka.
- Schema Registry: Confluent Schema Registry acts as a registry, and Apache Avro as the data format. Producers send schemas to the Schema Registry, and Consumers fetch the schema from it. Producers will push data into Avro format, and Consumers will read the data as Avro.