Module 3: Event-Driven & Asynchronous Systems
How Kafka, RabbitMQ, NATS, and pub/sub patterns let services decouple in time and scale — and the failure modes that come with them.
4 hours. 3 hands-on labs. Free course module.
Learning Objectives
- Choose between message queues, pub/sub, and event streaming for a given workload
- Reason about partitioning, ordering, and consumer groups in Kafka
- Implement backpressure correctly so producers do not melt consumers
- Design exactly-once semantics where you actually need them — and at-least-once where you do not
- Diagnose the canonical event-pipeline outages: lag spikes, rebalances, and stuck consumers
Why This Matters
Async event pipelines are how every modern company scales beyond the synchronous-RPC limits of microservices. The teams that get event streams right ship features 3x faster (independent producers and consumers, no tight coupling) and survive failures better (decoupled in time means downstream slow does not block upstream fast). The teams that get them wrong end up with stuck consumers, lost messages, exactly-once theatre, and data loss they only discover during a regulatory audit.
Lesson Content
Synchronous communication couples services in time. Asynchronous communication couples them only in contract. The producer publishes an event; the consumer reads it whenever it is ready, retries if it fails, and runs at a different rate than the producer. That decoupling is what lets event-driven systems scale to billions of events per day — and what creates the operational failure modes you have to learn to recognise.
Queue vs Pub/Sub vs Event Streaming — Pick One Deliberately
- Message queue (RabbitMQ, SQS): each message is delivered to one consumer. Used for work distribution: a queue of jobs, workers pull and process. Messages disappear after ack.
- Pub/sub (Redis Pub/Sub, NATS, Google Pub/Sub): each message is delivered to all subscribers. Used for fan-out notifications. Often ephemeral — missed messages are missed.
- Event streaming (Kafka, Kinesis, Pulsar): messages are appended to a durable log; consumers read at their own pace, can replay history, can have many independent groups. The dominant pattern for high-throughput data pipelines.
Kafka in Production
Kafka's mental model: a topic is a named, durable, append-only log split into partitions. Each partition is replicated across brokers (typically 3x). A producer writes records, optionally with a key; the key's hash determines the partition. A consumer group reads the topic; Kafka assigns each partition to one consumer in the group, so partition count caps consumer parallelism.
Three properties to internalise:
- Ordering is per-partition. Records with the same key land in the same partition and are read in order. Across partitions, ordering is undefined. Choose your key to align with the units you need ordered (e.g. user_id for per-user event ordering).
- Consumer groups are independent. Two consumer groups reading the same topic do not affect each other — each tracks its own offset. This is the foundation of event-driven architectures: the orders topic feeds a billing pipeline AND a search-indexer AND an audit log, all reading the same stream independently.
- Replication is for durability, not for read scale. Reads always go to the partition leader. Replicas exist so you can survive broker loss; they do not load-balance reads.
Delivery Guarantees — What Exactly-Once Actually Means
Three levels:
- At-most-once — fire and forget; on failure the message is lost. Acceptable for telemetry where loss is fine.
- At-least-once — the message will be delivered; possibly more than once. The default in most systems. Requires consumer-side idempotency (deduplication via idempotency keys).
- Exactly-once — the message is processed exactly once, end-to-end. Kafka's exactly-once semantics work between Kafka topics; once a message leaves Kafka and writes to an external system, you are back to “at-least-once + idempotency”.
Practical guidance: design for at-least-once with consumer-side idempotency as the default. Reach for exactly-once only when you genuinely cannot make consumers idempotent, and accept the operational cost.
Backpressure
Backpressure is the signal flowing upstream from a saturated consumer to a producer: “slow down, I cannot keep up”. Without it, producers happily fill queues until memory or disk runs out. Mechanisms:
- Bounded queues with blocking enqueue; the producer blocks when the queue is full.
- Reactive streams (Project Reactor, RxJava) with explicit demand signals.
- HTTP/2 flow control built into the protocol.
- Consumer-lag-driven producer throttling: if Kafka consumer lag exceeds a threshold, the producer service intentionally slows.
The opposite anti-pattern: an unbounded in-memory queue that quietly grows until the JVM OOMs. Always bound your queues.
Common Production Failures
- Consumer lag spike: the canonical Kafka alert. A consumer group falls behind the head of the log. Causes: consumer slowed down, partition imbalance, downstream dependency degraded. The metric to watch:
kafka_consumer_lag_maxper group. - Rebalance storm: every time a consumer joins or leaves the group, all partition assignments are recomputed and consumers pause. Frequent rebalances kill throughput. Causes: aggressive session timeouts, slow processing exceeding heartbeat, scaling churn. Mitigation: tune
session.timeout.ms,heartbeat.interval.ms, and use cooperative rebalance. - Hot partition: a partition key with skewed traffic (e.g. one big tenant) overloads one broker while others sit idle. Mitigation: salt the key, increase partition count, or use a different partitioning scheme.
- Stuck consumer: a consumer hangs on one bad message and stops making progress. Mitigation: per-message timeouts, dead-letter queue, observability on per-message processing time.
Backpressure Propagation
Self-Check Quiz
- You have a Kafka topic with 4 partitions and a consumer group of 6 consumers. What happens? (Answer: 4 consumers each own one partition; 2 are idle. Partition count caps consumer parallelism in a group.)
- Your team wants exactly-once delivery for a billing pipeline. What should you actually build? (Answer: at-least-once + consumer-side idempotency via idempotency keys. Kafka exactly-once works topic-to-topic but not topic-to-database.)
- Consumer lag is climbing for one group, while another group reading the same topic is fine. What do you check? (Answer: the lagging group's downstream dependency or processing logic. Same topic + different lags = consumer-side issue, not Kafka.)
- Why is partitioning by random key dangerous for an event stream where order matters per user? (Answer: messages for the same user end up on different partitions; per-user ordering is lost. Key by user_id.)
Real-World Use Cases
- LinkedIn (Kafka's birthplace) processes trillions of events per day across thousands of topics.
- Slack uses Kafka to fan out every message event to many internal consumers (search indexing, push notifications, analytics).
- Uber processes ride events through Kafka with strict per-rider ordering via partition keying.
- Netflix uses Kafka for the event bus underlying their playback telemetry, which feeds recommendations, observability, and billing.
Production Notes
- Always bound queues. Unbounded in-memory queues are delayed OOMs.
- Default to at-least-once + consumer-side idempotency. Reach for exactly-once only when you genuinely cannot make consumers idempotent.
- Watch consumer lag as a first-class metric. Lag spikes precede every event-pipeline incident.
- Tune Kafka session timeouts and heartbeat intervals carefully — too aggressive triggers rebalance storms.
Common Mistakes
- Choosing exactly-once because it sounds safer; ignoring the operational cost and accepting the false sense of security.
- Single-partition topics for “simplicity”; they cap consumer parallelism at 1 and become bottlenecks.
- Ignoring partition keys; uniform random keys feel safe but break per-entity ordering.
Security Risks to Watch
- Topics with PII data and no ACLs let any service in the cluster read sensitive events. Default-deny + per-consumer-group ACLs.
- Producer credentials in env vars can be exfiltrated; rotate via secret managers, not by redeploy.
- Consumer offset tampering can replay or skip messages; secure the offset commit path.
- mTLS between Kafka clients and brokers is mandatory for any production deployment over an untrusted network.
Design Tradeoffs
Kafka
Pros
- Durable replayable log
- Multiple independent consumer groups
- Massive throughput
- Mature ecosystem
Cons
- Operational complexity (ZooKeeper/KRaft, brokers)
- Heavy for small workloads
- Topic / partition design is permanent
RabbitMQ
Pros
- Flexible routing
- Lower operational footprint
- Simpler conceptually
Cons
- Lower max throughput
- No replay (messages disappear after ack)
- Per-queue scalability limits
NATS / Redis Streams
Pros
- Lightweight
- Easy to operate
- Low latency
Cons
- Less durable than Kafka
- Smaller ecosystem
- Not suited for huge volumes
Production Alternatives
- Apache Kafka: Industry standard for high-throughput event streaming; partitioned, replayable, durable.
- AWS Kinesis: Managed Kafka-equivalent on AWS; tighter integration with AWS services, less ecosystem flexibility.
- Apache Pulsar: Tiered storage by default, multi-tenancy, geo-replication built in. Strong fit for multi-region from day one.
- NATS JetStream: Lightweight alternative for lower-volume event streaming; simpler to operate.
- Redis Streams: Best fit when you already run Redis and event volume is low; not a replacement for Kafka at scale.
Think Like an Engineer
- Before designing topics, sketch the consumer graph. Each consumer group reads independently — what happens if one falls behind?
- Choose partition keys around the unit of ordering you actually need (per-user, per-order, per-tenant). The wrong key destroys ordering guarantees you needed.
- Treat the dead-letter queue depth as a SEV indicator. A growing DLQ means production data is silently being skipped.
Production Story
A logistics startup chose “exactly-once” for their order-tracking event stream because it sounded safer. Six months later a region-level Kafka outage exposed how brittle the exactly-once semantics were under partial failure: messages stuck in transactional limbo, consumer offsets out of sync with downstream state, and no one on the team understood the recovery flow well enough to act in under an hour. They eventually rewrote the consumer to be idempotent (UNIQUE constraint on order_id + version) and downgraded to at-least-once delivery. Recovery time went from hours to minutes.
Key Terms
- Topic
- A named, durable, append-only log in Kafka, split into partitions for parallelism.
- Consumer group
- A set of consumers that share partition assignments; each partition is read by exactly one consumer in the group.
- Idempotency
- Property where applying the same operation multiple times produces the same effect as applying it once.
- Backpressure
- Signal flowing upstream telling producers to slow down because consumers cannot keep up.
- Consumer lag
- How far behind the head of the log a consumer group is; the canonical Kafka health metric.
Hands-On Labs
-
Lab 3.1 — Kafka Event Pipeline End-to-End
Build a producer/consumer pipeline with proper key-based partitioning and consumer groups.
90 minutes - Intermediate
- Spin up Kafka via docker-compose
- Write a producer that emits orders keyed by user_id
- Write two consumer groups (billing, audit) reading the same topic
- Verify per-key ordering on each partition
- Kill a broker and verify durability
-
Lab 3.2 — Backpressure in a Reactive Pipeline
Reproduce a runaway producer; introduce bounded queues; observe stable throughput.
60 minutes - Intermediate
- Implement an unbounded in-memory queue between producer and consumer
- Run with producer at 10x consumer rate; watch memory grow
- Replace with a bounded queue
- Observe blocking on the producer; system stabilises
-
Lab 3.3 — Idempotent Consumer with Dedup
Design a consumer that processes at-least-once messages exactly once via idempotency keys.
60 minutes - Intermediate
- Use a Postgres table with unique constraint on idempotency_key
- Process orders; on duplicate, ignore
- Inject duplicate messages; verify only one effect per key
Key Takeaways
- Async decouples in time, not in contract — the contract still has to be designed carefully
- Pick queue / pub-sub / event-streaming based on whether you need work-distribution, fan-out, or replayable log
- Default to at-least-once + consumer-side idempotency; reach for exactly-once only with reason
- Always bound your queues; unbounded is a delayed OOM
- Watch consumer lag as a first-class metric; it is the early warning of every event-pipeline incident