Kafka Fundamentals
Apache Kafka is a distributed commit log, not just a message queue. That distinction matters. A message queue delivers a message and it disappears. Kafka writes messages to a durable, ordered, append-only log that consumers can read and re-read at will. This fundamental design choice is why Kafka became the backbone of event-driven data engineering.
The Distributed Commit Log
At its core, Kafka is a distributed, partitioned, replicated log of records. Every message written to Kafka is appended to the end of a log and assigned a sequential offset number. Messages are immutable once written. This is the same concept as a database write-ahead log, but exposed as a first-class API.
Partition 0: [msg0] [msg1] [msg2] [msg3] [msg4] [msg5]
^
latest offset
Partition 1: [msg0] [msg1] [msg2] [msg3]
^
latest offset
This design means consumers can rewind and replay data, multiple consumers can read the same data independently, and Kafka retains messages for a configurable period regardless of whether anyone has read them.
Topics & Partitions
Topics
A topic is a named feed of messages. Think of it as a category. An e-commerce platform might have topics like orders, page-views, inventory-updates, and user-signups. Producers write to topics. Consumers read from topics.
Partitions
Each topic is split into one or more partitions. Partitions are the unit of parallelism in Kafka. A topic with 12 partitions can be read by up to 12 consumers simultaneously.
Messages within a single partition are strictly ordered. Messages across partitions have no ordering guarantee. This is a critical design point: if you need ordering for a specific entity (say, all events for a given user), you must ensure those events land in the same partition.
Topic: orders (3 partitions)
Partition 0: [order-101] [order-104] [order-107]
Partition 1: [order-102] [order-105] [order-108]
Partition 2: [order-103] [order-106] [order-109]
Kafka assigns messages to partitions using a partition key. If you set the key to user_id, all events for user 42 go to the same partition, preserving their order.
Choosing Partition Count
More partitions mean more parallelism but also more overhead. A common starting point:
- Low-throughput topics (< 10 MB/s): 6-12 partitions
- High-throughput topics (> 100 MB/s): 30-100 partitions
- You can increase partitions later, but you cannot decrease them
Producers
A producer writes messages to a Kafka topic. The producer decides which partition a message goes to, either explicitly (by specifying a partition), by key (Kafka hashes the key to a partition), or round-robin (if no key is set).
from confluent_kafka import Producer
config = {
'bootstrap.servers': 'kafka-broker-1:9092,kafka-broker-2:9092',
'acks': 'all', # Wait for all replicas to acknowledge
}
producer = Producer(config)
def delivery_report(err, msg):
if err:
print(f"Delivery failed: {err}")
else:
print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")
# Key ensures all events for this order go to the same partition
producer.produce(
topic='orders',
key='order-12345',
value='{"status": "created", "amount": 99.99}',
callback=delivery_report,
)
producer.flush()
Producer Acknowledgments
The acks setting controls durability:
acks=0: Fire and forget. Fastest, but messages can be lost.acks=1: Wait for the leader replica to acknowledge. A good balance.acks=all: Wait for all in-sync replicas to acknowledge. Safest, but adds latency.
For data engineering pipelines where correctness matters, use acks=all.
Consumers & Consumer Groups
Consumers
A consumer reads messages from one or more topic partitions. It tracks its position (offset) in each partition.
from confluent_kafka import Consumer
config = {
'bootstrap.servers': 'kafka-broker-1:9092',
'group.id': 'order-processor',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False,
}
consumer = Consumer(config)
consumer.subscribe(['orders'])
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
print(f"Error: {msg.error()}")
continue
print(f"Received: {msg.value().decode('utf-8')}")
# Process the message, then commit
consumer.commit(asynchronous=False)
finally:
consumer.close()
Consumer Groups
A consumer group is a set of consumers that cooperate to consume a topic. Kafka assigns each partition to exactly one consumer in the group. If you have 12 partitions and 4 consumers in a group, each consumer gets 3 partitions.
Topic: orders (6 partitions)
Consumer Group: order-processor (3 consumers)
Consumer A: reads Partition 0, Partition 1
Consumer B: reads Partition 2, Partition 3
Consumer C: reads Partition 4, Partition 5
If Consumer B crashes, Kafka rebalances: Consumer A and C split B's partitions. When B comes back, Kafka rebalances again. This is how Kafka achieves fault-tolerant consumption.
Multiple consumer groups can read the same topic independently. The analytics pipeline and the notification service can each have their own group, reading the same data at their own pace.
Offset Management
The offset is a consumer's bookmark. It tells Kafka "I have processed all messages up to this point."
Auto-Commit
By default, Kafka auto-commits offsets periodically (every 5 seconds). This is convenient but dangerous. If your consumer crashes after committing but before finishing processing, you lose messages. If it crashes after processing but before committing, you process messages twice.
Manual Commit
For data engineering, manual commit is safer. Process the message, write the result, then commit the offset. This gives you at-least-once semantics: you might process a message twice after a crash, but you will never skip one.
# Process, then commit
for msg in messages:
process(msg)
consumer.commit() # Only after successful processing
Seek and Replay
Because Kafka retains messages, you can reset a consumer's offset to replay data. This is invaluable for data engineering:
- A bug in your pipeline? Fix the code, reset the offset, reprocess.
- Need to backfill a new table? Start from the earliest offset.
- Want to test a new transformation? Spin up a new consumer group and replay.
Retention Policies
Kafka does not keep messages forever (by default). Retention policies control how long data stays.
Time-Based Retention
The most common approach. The default is 7 days. For event sourcing or compliance, you might set it to 30 days or even years.
# Keep messages for 30 days
log.retention.hours=720
Size-Based Retention
Cap the total size of a topic's log.
# Keep up to 100 GB per partition
log.retention.bytes=107374182400
Compacted Topics
For topics that represent the latest state of an entity (like a user profile), log compaction keeps only the most recent message per key. Older messages with the same key are garbage collected.
cleanup.policy=compact
This turns Kafka into a key-value store that also supports streaming reads.
When You Need Kafka
Kafka is not always the right tool. Here is when it earns its complexity:
High-Throughput Event Ingestion
Clickstream data, IoT sensors, application logs. When you need to ingest millions of events per second and make them available to multiple downstream systems, Kafka is purpose-built for this.
Event Sourcing
Instead of storing the current state, you store the sequence of events that led to the current state. Kafka's durable, ordered log is a natural fit. An order goes through created -> paid -> shipped -> delivered, and each transition is an event in the log.
Log Aggregation
Collecting logs from hundreds of application instances into a central system. Kafka replaces ad-hoc log shipping with a reliable, scalable transport layer. Downstream, you can fan out to Elasticsearch for search, S3 for archival, and Spark for analytics.
Decoupling Microservices
When Service A needs to notify Services B, C, and D about an event, direct HTTP calls create tight coupling. With Kafka, Service A publishes an event. Services B, C, and D each consume it at their own pace.
When You Do Not Need Kafka
- Your throughput is under 1,000 events per second. A database and polling will work fine.
- You have a single consumer. A simple message queue (RabbitMQ, SQS) is operationally simpler.
- You need request-response communication. Kafka is for async event streams, not RPC.
- Your team is small and cannot afford the operational overhead of managing a Kafka cluster.
Kafka Architecture Internals
Brokers
A Kafka cluster consists of multiple brokers (servers). Each broker stores a subset of partitions. One broker is elected as the controller, responsible for managing partition leader elections.
Replication
Each partition has a configurable number of replicas spread across brokers. One replica is the leader (handles all reads and writes). The others are followers that replicate the leader's log. If the leader fails, a follower is promoted.
Topic: orders, Partition 0, Replication Factor: 3
Broker 1: Leader [msg0] [msg1] [msg2] [msg3]
Broker 2: Follower [msg0] [msg1] [msg2] [msg3]
Broker 3: Follower [msg0] [msg1] [msg2] <- slightly behind (ISR)
ZooKeeper vs KRaft
Historically, Kafka used ZooKeeper for metadata management. Since Kafka 3.3, KRaft mode replaces ZooKeeper with a built-in Raft consensus protocol. New deployments should use KRaft. It simplifies operations by removing the ZooKeeper dependency.
Common Pitfalls
- Too few partitions at creation time. Increasing partitions later breaks key-based ordering guarantees for existing data. Plan ahead.
- Using Kafka as a database. Kafka is a log, not a query engine. You cannot efficiently look up a specific message by anything other than offset.
- Ignoring consumer lag. If consumers fall behind producers, you have a problem. Monitor consumer lag and alert before it hits retention boundaries.
- Not setting a partition key. Without a key, messages round-robin across partitions. If you need ordering per entity, you need a key.
- Auto-committing offsets. For data engineering pipelines, this leads to data loss or duplication on failures. Use manual commits.
- Underestimating operational complexity. Kafka requires monitoring, disk management, broker upgrades, and partition rebalancing. Use managed services (Confluent Cloud, AWS MSK) if your team is small.
- Setting retention too short. If a consumer goes down for a week and your retention is 3 days, those messages are gone. Set retention based on your worst-case recovery time.
Key Takeaways
- Kafka is a distributed commit log, not a message queue. Messages are durable, ordered, and replayable.
- Topics are split into partitions for parallelism. Ordering is guaranteed only within a partition.
- Consumer groups enable parallel consumption with automatic failover.
- Manual offset management gives you control over exactly when data is marked as processed.
- Retention policies (time, size, compaction) control how long data lives in Kafka.
- Kafka shines for high-throughput event ingestion, event sourcing, and decoupling systems. It is overkill for low-volume, single-consumer use cases.
- Operational complexity is real. Budget for monitoring, alerting, and capacity planning.