-No replay capability; lost messages are lost forever
-Manual exactly-once theatre with brittle distributed transactions
After
+Decoupled in time via durable event log; producers and consumers scale independently
+New consumer groups added without producer changes
+Replay history for backfills, debugging, new use cases
+At-least-once + idempotency at the consumer; pragmatic, robust, simple
Synchronous communication couples services in time. Asynchronous communication couples them only in contract. The producer publishes an event; the consumer reads it whenever it is ready, retries if it fails, and runs at a different rate than the producer. That decoupling is what lets event-driven systems scale to billions of events per day — and what creates the operational failure modes you have to learn to recognise.
Queue vs Pub/Sub vs Event Streaming — Pick One Deliberately
Message queue (RabbitMQ, SQS): each message is delivered to one consumer. Used for work distribution: a queue of jobs, workers pull and process. Messages disappear after ack.
Pub/sub (Redis Pub/Sub, NATS, Google Pub/Sub): each message is delivered to all subscribers. Used for fan-out notifications. Often ephemeral — missed messages are missed.
Event streaming (Kafka, Kinesis, Pulsar): messages are appended to a durable log; consumers read at their own pace, can replay history, can have many independent groups. The dominant pattern for high-throughput data pipelines.
Kafka in Production
Kafka's mental model: a topic is a named, durable, append-only log split into partitions. Each partition is replicated across brokers (typically 3x). A producer writes records, optionally with a key; the key's hash determines the partition. A consumer group reads the topic; Kafka assigns each partition to one consumer in the group, so partition count caps consumer parallelism.
Three properties to internalise:
Ordering is per-partition. Records with the same key land in the same partition and are read in order. Across partitions, ordering is undefined. Choose your key to align with the units you need ordered (e.g. user_id for per-user event ordering).
Consumer groups are independent. Two consumer groups reading the same topic do not affect each other — each tracks its own offset. This is the foundation of event-driven architectures: the orders topic feeds a billing pipeline AND a search-indexer AND an audit log, all reading the same stream independently.
Replication is for durability, not for read scale. Reads always go to the partition leader. Replicas exist so you can survive broker loss; they do not load-balance reads.
Delivery Guarantees — What Exactly-Once Actually Means
Three levels:
At-most-once — fire and forget; on failure the message is lost. Acceptable for telemetry where loss is fine.
At-least-once — the message will be delivered; possibly more than once. The default in most systems. Requires consumer-side idempotency (deduplication via idempotency keys).
Exactly-once — the message is processed exactly once, end-to-end. Kafka's exactly-once semantics work between Kafka topics; once a message leaves Kafka and writes to an external system, you are back to “at-least-once + idempotency”.
Practical guidance: design for at-least-once with consumer-side idempotency as the default. Reach for exactly-once only when you genuinely cannot make consumers idempotent, and accept the operational cost.
Backpressure
Backpressure is the signal flowing upstream from a saturated consumer to a producer: “slow down, I cannot keep up”. Without it, producers happily fill queues until memory or disk runs out. Mechanisms:
Bounded queues with blocking enqueue; the producer blocks when the queue is full.
Reactive streams (Project Reactor, RxJava) with explicit demand signals.
HTTP/2 flow control built into the protocol.
Consumer-lag-driven producer throttling: if Kafka consumer lag exceeds a threshold, the producer service intentionally slows.
The opposite anti-pattern: an unbounded in-memory queue that quietly grows until the JVM OOMs. Always bound your queues.
Common Production Failures
Consumer lag spike: the canonical Kafka alert. A consumer group falls behind the head of the log. Causes: consumer slowed down, partition imbalance, downstream dependency degraded. The metric to watch: kafka_consumer_lag_max per group.
Rebalance storm: every time a consumer joins or leaves the group, all partition assignments are recomputed and consumers pause. Frequent rebalances kill throughput. Causes: aggressive session timeouts, slow processing exceeding heartbeat, scaling churn. Mitigation: tune session.timeout.ms, heartbeat.interval.ms, and use cooperative rebalance.
Hot partition: a partition key with skewed traffic (e.g. one big tenant) overloads one broker while others sit idle. Mitigation: salt the key, increase partition count, or use a different partitioning scheme.
Stuck consumer: a consumer hangs on one bad message and stops making progress. Mitigation: per-message timeouts, dead-letter queue, observability on per-message processing time.
Backpressure Propagation
Self-Check Quiz
You have a Kafka topic with 4 partitions and a consumer group of 6 consumers. What happens? (Answer: 4 consumers each own one partition; 2 are idle. Partition count caps consumer parallelism in a group.)
Your team wants exactly-once delivery for a billing pipeline. What should you actually build? (Answer: at-least-once + consumer-side idempotency via idempotency keys. Kafka exactly-once works topic-to-topic but not topic-to-database.)
Consumer lag is climbing for one group, while another group reading the same topic is fine. What do you check? (Answer: the lagging group's downstream dependency or processing logic. Same topic + different lags = consumer-side issue, not Kafka.)
Why is partitioning by random key dangerous for an event stream where order matters per user? (Answer: messages for the same user end up on different partitions; per-user ordering is lost. Key by user_id.)
Real world
Where this shows up
-LinkedIn (Kafka's birthplace) processes trillions of events per day across thousands of topics.
-Slack uses Kafka to fan out every message event to many internal consumers (search indexing, push notifications, analytics).
-Uber processes ride events through Kafka with strict per-rider ordering via partition keying.
-Netflix uses Kafka for the event bus underlying their playback telemetry, which feeds recommendations, observability, and billing.
Production notes
Keep these close
!Always bound queues. Unbounded in-memory queues are delayed OOMs.
!Default to at-least-once + consumer-side idempotency. Reach for exactly-once only when you genuinely cannot make consumers idempotent.
!Watch consumer lag as a first-class metric. Lag spikes precede every event-pipeline incident.
!Tune Kafka session timeouts and heartbeat intervals carefully — too aggressive triggers rebalance storms.
Common mistakes
What usually breaks
!Choosing exactly-once because it sounds safer; ignoring the operational cost and accepting the false sense of security.
!Single-partition topics for “simplicity”; they cap consumer parallelism at 1 and become bottlenecks.
!Ignoring partition keys; uniform random keys feel safe but break per-entity ordering.
Security risks
Threats to watch
!Topics with PII data and no ACLs let any service in the cluster read sensitive events. Default-deny + per-consumer-group ACLs.
!Producer credentials in env vars can be exfiltrated; rotate via secret managers, not by redeploy.
!Consumer offset tampering can replay or skip messages; secure the offset commit path.
!mTLS between Kafka clients and brokers is mandatory for any production deployment over an untrusted network.
Industry standard for high-throughput event streaming; partitioned, replayable, durable.
AWS Kinesis
Managed Kafka-equivalent on AWS; tighter integration with AWS services, less ecosystem flexibility.
Apache Pulsar
Tiered storage by default, multi-tenancy, geo-replication built in. Strong fit for multi-region from day one.
NATS JetStream
Lightweight alternative for lower-volume event streaming; simpler to operate.
Redis Streams
Best fit when you already run Redis and event volume is low; not a replacement for Kafka at scale.
Think like an engineer
Questions to answer before shipping
?Before designing topics, sketch the consumer graph. Each consumer group reads independently — what happens if one falls behind?
?Choose partition keys around the unit of ordering you actually need (per-user, per-order, per-tenant). The wrong key destroys ordering guarantees you needed.
?Treat the dead-letter queue depth as a SEV indicator. A growing DLQ means production data is silently being skipped.
Key terms
Vocabulary used in this module
Topic
A named, durable, append-only log in Kafka, split into partitions for parallelism.
Consumer group
A set of consumers that share partition assignments; each partition is read by exactly one consumer in the group.
Idempotency
Property where applying the same operation multiple times produces the same effect as applying it once.
Backpressure
Signal flowing upstream telling producers to slow down because consumers cannot keep up.
Consumer lag
How far behind the head of the log a consumer group is; the canonical Kafka health metric.
Labs
Hands-on labs
1
90 minutesIntermediate
Lab 3.1 — Kafka Event Pipeline End-to-End
Build a producer/consumer pipeline with proper key-based partitioning and consumer groups.
-Spin up Kafka via docker-compose
-Write a producer that emits orders keyed by user_id
-Write two consumer groups (billing, audit) reading the same topic