Apache Kafka for Beginners | guh.me - gustavo's personal blog

My notes from the Udemy course: Apache Kafka for Beginners.

Kafka Theory

Topics: a particular stream of data.
- It is similar to a table in a database, without all the constraints.
- A topic is identified by its name.
Partitions: topics are split in partitions. They are ordered, and each message within a partition gets an incremental ID, called offset. You need to define the number of partitions when creating a topic.
- Offsets has only meaning for a specific partition.
- Order is guaranteed only within a partition.
- Once data is written to a partition, it cannot be changed (immutability).
- Data is assigned randomly to a partition unless a key is provided.
Brokers are nothing else than servers, which are used to compose a Kafka cluster.
- Each broker has its own ID (integer).
- Each broker contains only certain topic partitions, but not all of them.
- A good number to get started is 3 brokers.
Topic replication factor
- Topics should have a replication factor >1 (usually between 2 and 3).
- This way, if a broker is down, another broker can serve the data.
- Leader for a partition defines a broker that is a leader for a given partition, which means receiving and serving the data for a partition. The other broker]s (ISR - in-sync replicas) will synchronize the data.
Producers
- Producers write data to topics.
- Producers automatically know to which broker and partition to write to.
- Producers automatically recover when a broker fails.
- Producers can choose to receive acknowledgement of data writes:
  - acks=0: producer will not wait for acknowledgement (possible data loss)
  - acks=1: producer will wait for leader acknowledgement (limited data loss)
  - acks=all: producer will wait for leader and replicas acknowledgement (no data loss)
- Message keys: a key can be sent by the producer, which will route the message to a certain partition. If no key is sent, data is routed via round robin.
  - A key must be sent if you need message ordering for a specific field.
  - All messages with the same key go to the same partition.
Consumers
- Consumers read data from a topic (identified by a name).
- Consumers know which broker to read from.
- Consumers automatically recover when a broker fails.
- Data is read in order within each partition.
- Consumers read data in consumer groups. Each consumer within a group reads from exclusive partitions; if there are more consumers than partitions, some consumers will be inactive.
Consumer offsets
- Kafka stores the offsets at which a consumer group has been reading, which are commited to a topic called __consumer_offsets.
- If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
- Consumers can choose when to commit offsets: at most once, at least once (preferred) and exactly once.
Kafka Broker Discovery
- Every Kafka broker is also a “bootstrap server”, which means they about every other broke, topics and partitions (metadata).
- Once you connect to one broker, you will be connected to the entire cluster.
Zookeeper
- Zookeeper manages the brokers, and performs leader election for partitions.
- Zookeeper sends notification to Kafka in case of changes (e.g. new broker, broker dies, delete topic, etc).
- Kafka cannot work without Zookeeper.
- Zookeeper operates by design with an odd number of servers (3, 5, 7…).
Kafka guarantees
- Messages are appended to a topic-partition in the order they are sent.
- Consumers read messages in the order stored in a topic-partition.
- With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down.
- As long as the number of partitions for a topic remains constant, the same key will always go to the same partition.

Kafka Command Line Interface

Kafka Topics

kafka-topics can be used to manage topics. You must specify Zookeeper when running any operation.

Create a topic

kafka-topics \
    --zookeper 127.0.0.1:2181 \
    --topic my-kafka-topic \
    --partitions 3 \
    --replication-factor 1 \
    --create

List topics

kafka-topics \
    --zookeper 127.0.0.1:2181 \
    --list

Describe a topic

kafka-topics \
    --zookeper 127.0.0.1:2181 \
    --topic my-kafka-topic \
    --describe

Delete a topic

kafka-topics \
    --zookeper 127.0.0.1:2181 \
    --topic my-kafka-topic \
    --delete

Kafka Console Producer

kafka-console-producer reads data from standard input and publishes it to Kafka.

Produce a message

kafka-console-producer \
    --broker-list 127.0.0.1:9092 \
    --topic my-kafka-topic

‘CTRL + C` will interrupt the input.

Produce a message with properties

kafka-console-producer \
    --broker-list 127.0.0.1:9092 \
    --topic my-kafka-topic
    --producer-property acks=all

Kafka Console Consumer

kafka-console-consumer reads data from Kafka and outputs it to standard output.

Consume a topic

kafka-console-consumer \
    --bootstrap-server 127.0.0.1:9092 \
    --topic my-kafka-topic

Consume a topic from beginning

kafka-console-consumer \
    --bootstrap-server 127.0.0.1:9092 \
    --topic my-kafka-topic \
    --from-beginning

Consume a topic with group

kafka-console-consumer \
    --bootstrap-server 127.0.0.1:9092 \
    --topic my-kafka-topic \
    --group my-application

Kafka Consumer Groups

kafka-consumer-groups lists, describes, deletes and resets offsets for consumer groups.

List consumer groups

kafka-consumer-groups \
    --bootstrap-server 127.0.0.1:9092 \
    --list

Describe a consumer group

kafka-consumer-groups \
    --bootstrap-server 127.0.0.1:9092 \
    --group my-application \
    --describe

Reset the offset for a group

kafka-consumer-groups \
    --bootstrap-server 127.0.0.1:9092 \
    --group my-application \
    --reset-offsets \
    --to-earliest \
    --topic my-kafka-topic \
    --execute

Advanced Configuration

Idempotent Producers

An idempotent producer can be defined, so no duplicates are introduced on network errors.
Idempotent producers guarantee a safe and stable pipeline.
It can be set via the property enable.idempotent.

Message Compression

Compression is enabled at the Producer level and does not require changes in the Brokers or in the Consumers.
It can be set via the property compression.type, which supports the following values: none, gzip, lz4, snappy.
Tweaking the properties linger.ms and batch.size can help improve compression and higher throughput when using compression.

Kafka Ecosystem

Kafka Connect: using the Connect APIs, it simplifies and improves getting data in and out of Kafka.
Kafka Streams: provides data processing and transformation within Kafka.
Schema Registry: Confluent Schema Registry acts as a registry, and Apache Avro as the data format. Producers send schemas to the Schema Registry, and Consumers fetch the schema from it. Producers will push data into Avro format, and Consumers will read the data as Avro.