Module 3 of 12

Event-Driven & Asynchronous Systems

How Kafka, RabbitMQ, NATS, and pub/sub patterns let services decouple in time and scale - and the failure modes that come with them.

4 hours3 labsFree

Watch as Slides Course overview Lab code

Start here

Learning objectives

Choose between message queues, pub/sub, and event streaming for a given workload
Reason about partitioning, ordering, and consumer groups in Kafka
Implement backpressure correctly so producers do not melt consumers
Design exactly-once semantics where you actually need them - and at-least-once where you do not
Diagnose the canonical event-pipeline outages: lag spikes, rebalances, and stuck consumers

Before

Synchronous RPC between services; one slow service blocks the chain
Tight coupling: producer changes require consumer changes
No replay capability; lost messages are lost forever
Manual exactly-once theatre with brittle distributed transactions

After

Decoupled in time via durable event log; producers and consumers scale independently
New consumer groups added without producer changes
Replay history for backfills, debugging, new use cases
At-least-once + idempotency at the consumer; pragmatic, robust, simple

Synchronous communication couples services in time. Asynchronous communication couples them only in contract. The producer publishes an event; the consumer reads it whenever it is ready, retries if it fails, and runs at a different rate than the producer. That decoupling is what lets event-driven systems scale to billions of events per day - and what creates the operational failure modes you have to learn to recognise.

Queue vs Pub/Sub vs Event Streaming - Pick One Deliberately

Message queue (RabbitMQ, SQS): each message is delivered to one consumer. Used for work distribution: a queue of jobs, workers pull and process. Messages disappear after ack.
Pub/sub (Redis Pub/Sub, NATS, Google Pub/Sub): each message is delivered to all subscribers. Used for fan-out notifications. Often ephemeral - missed messages are missed.
Event streaming (Kafka, Kinesis, Pulsar): messages are appended to a durable log; consumers read at their own pace, can replay history, can have many independent groups. The dominant pattern for high-throughput data pipelines.

Kafka in Production

Kafka's mental model: a topic is a named, durable, append-only log split into partitions. Each partition is replicated across brokers (typically 3x). A producer writes records, optionally with a key; the key's hash determines the partition. A consumer group reads the topic; Kafka assigns each partition to one consumer in the group, so partition count caps consumer parallelism.

Three properties to internalise:

Ordering is per-partition. Records with the same key land in the same partition and are read in order. Across partitions, ordering is undefined. Choose your key to align with the units you need ordered (e.g. user_id for per-user event ordering).
Consumer groups are independent. Two consumer groups reading the same topic do not affect each other - each tracks its own offset. This is the foundation of event-driven architectures: the orders topic feeds a billing pipeline AND a search-indexer AND an audit log, all reading the same stream independently.
Replication is for durability, not for read scale. Reads always go to the partition leader. Replicas exist so you can survive broker loss; they do not load-balance reads.

Delivery Guarantees - What Exactly-Once Actually Means

Three levels:

At-most-once - fire and forget; on failure the message is lost. Acceptable for telemetry where loss is fine.
At-least-once - the message will be delivered; possibly more than once. The default in most systems. Requires consumer-side idempotency (deduplication via idempotency keys).
Exactly-once - the message is processed exactly once, end-to-end. Kafka's exactly-once semantics work between Kafka topics; once a message leaves Kafka and writes to an external system, you are back to “at-least-once + idempotency”.

Practical guidance: design for at-least-once with consumer-side idempotency as the default. Reach for exactly-once only when you genuinely cannot make consumers idempotent, and accept the operational cost.

Backpressure

Backpressure is the signal flowing upstream from a saturated consumer to a producer: “slow down, I cannot keep up”. Without it, producers happily fill queues until memory or disk runs out. Mechanisms:

Bounded queues with blocking enqueue; the producer blocks when the queue is full.
Reactive streams (Project Reactor, RxJava) with explicit demand signals.
HTTP/2 flow control built into the protocol.
Consumer-lag-driven producer throttling: if Kafka consumer lag exceeds a threshold, the producer service intentionally slows.

The opposite anti-pattern: an unbounded in-memory queue that quietly grows until the JVM OOMs. Always bound your queues.

Common Production Failures

Consumer lag spike: the canonical Kafka alert. A consumer group falls behind the head of the log. Causes: consumer slowed down, partition imbalance, downstream dependency degraded. The metric to watch: kafka_consumer_lag_max per group.
Rebalance storm: every time a consumer joins or leaves the group, all partition assignments are recomputed and consumers pause. Frequent rebalances kill throughput. Causes: aggressive session timeouts, slow processing exceeding heartbeat, scaling churn. Mitigation: tune session.timeout.ms, heartbeat.interval.ms, and use cooperative rebalance.
Hot partition: a partition key with skewed traffic (e.g. one big tenant) overloads one broker while others sit idle. Mitigation: salt the key, increase partition count, or use a different partitioning scheme.
Stuck consumer: a consumer hangs on one bad message and stops making progress. Mitigation: per-message timeouts, dead-letter queue, observability on per-message processing time.

Backpressure Propagation

Self-Check Quiz

You have a Kafka topic with 4 partitions and a consumer group of 6 consumers. What happens? (Answer: 4 consumers each own one partition; 2 are idle. Partition count caps consumer parallelism in a group.)
Your team wants exactly-once delivery for a billing pipeline. What should you actually build? (Answer: at-least-once + consumer-side idempotency via idempotency keys. Kafka exactly-once works topic-to-topic but not topic-to-database.)
Consumer lag is climbing for one group, while another group reading the same topic is fine. What do you check? (Answer: the lagging group's downstream dependency or processing logic. Same topic + different lags = consumer-side issue, not Kafka.)
Why is partitioning by random key dangerous for an event stream where order matters per user? (Answer: messages for the same user end up on different partitions; per-user ordering is lost. Key by user_id.)

Real world

Where this shows up

LinkedIn (Kafka's birthplace) processes trillions of events per day across thousands of topics.
Slack uses Kafka to fan out every message event to many internal consumers (search indexing, push notifications, analytics).
Uber processes ride events through Kafka with strict per-rider ordering via partition keying.
Netflix uses Kafka for the event bus underlying their playback telemetry, which feeds recommendations, observability, and billing.

Production notes

Keep these close

Always bound queues. Unbounded in-memory queues are delayed OOMs.
Default to at-least-once + consumer-side idempotency. Reach for exactly-once only when you genuinely cannot make consumers idempotent.
Watch consumer lag as a first-class metric. Lag spikes precede every event-pipeline incident.
Tune Kafka session timeouts and heartbeat intervals carefully - too aggressive triggers rebalance storms.

Common mistakes

What usually breaks

Choosing exactly-once because it sounds safer; ignoring the operational cost and accepting the false sense of security.
Single-partition topics for “simplicity”; they cap consumer parallelism at 1 and become bottlenecks.
Ignoring partition keys; uniform random keys feel safe but break per-entity ordering.

Security risks

Threats to watch

Topics with PII data and no ACLs let any service in the cluster read sensitive events. Default-deny + per-consumer-group ACLs.
Producer credentials in env vars can be exfiltrated; rotate via secret managers, not by redeploy.
Consumer offset tampering can replay or skip messages; secure the offset commit path.
mTLS between Kafka clients and brokers is mandatory for any production deployment over an untrusted network.

Tradeoffs

Design choices you should be able to defend

Kafka

Pros

Durable replayable log
Multiple independent consumer groups
Massive throughput
Mature ecosystem

Cons

Operational complexity (ZooKeeper/KRaft, brokers)
Heavy for small workloads
Topic / partition design is permanent

RabbitMQ

Pros

Flexible routing
Lower operational footprint
Simpler conceptually

Cons

Lower max throughput
No replay (messages disappear after ack)
Per-queue scalability limits

NATS / Redis Streams

Pros

Lightweight
Easy to operate
Low latency

Cons

Less durable than Kafka
Smaller ecosystem
Not suited for huge volumes

Alternatives

Other production approaches

Apache Kafka

Industry standard for high-throughput event streaming; partitioned, replayable, durable.

AWS Kinesis

Managed Kafka-equivalent on AWS; tighter integration with AWS services, less ecosystem flexibility.

Apache Pulsar

Tiered storage by default, multi-tenancy, geo-replication built in. Strong fit for multi-region from day one.

NATS JetStream

Lightweight alternative for lower-volume event streaming; simpler to operate.

Redis Streams

Best fit when you already run Redis and event volume is low; not a replacement for Kafka at scale.

Think like an engineer

Questions to answer before shipping

Before designing topics, sketch the consumer graph. Each consumer group reads independently - what happens if one falls behind?
Choose partition keys around the unit of ordering you actually need (per-user, per-order, per-tenant). The wrong key destroys ordering guarantees you needed.
Treat the dead-letter queue depth as a SEV indicator. A growing DLQ means production data is silently being skipped.

Key terms

Vocabulary used in this module

Topic

A named, durable, append-only log in Kafka, split into partitions for parallelism.

Consumer group

A set of consumers that share partition assignments; each partition is read by exactly one consumer in the group.

Idempotency

Property where applying the same operation multiple times produces the same effect as applying it once.

Backpressure

Signal flowing upstream telling producers to slow down because consumers cannot keep up.

Consumer lag

How far behind the head of the log a consumer group is; the canonical Kafka health metric.

Labs

Hands-on labs

90 minutesIntermediate

Lab 3.1 - Kafka Event Pipeline End-to-End

Build a producer/consumer pipeline with proper key-based partitioning and consumer groups.

Spin up Kafka via docker-compose
Write a producer that emits orders keyed by user_id
Write two consumer groups (billing, audit) reading the same topic
Verify per-key ordering on each partition
Kill a broker and verify durability

View lab on GitHub

60 minutesIntermediate

Lab 3.2 - Backpressure in a Reactive Pipeline

Reproduce a runaway producer; introduce bounded queues; observe stable throughput.

Implement an unbounded in-memory queue between producer and consumer
Run with producer at 10x consumer rate; watch memory grow
Replace with a bounded queue
Observe blocking on the producer; system stabilises

View lab on GitHub

60 minutesIntermediate

Lab 3.3 - Idempotent Consumer with Dedup

Design a consumer that processes at-least-once messages exactly once via idempotency keys.

Use a Postgres table with unique constraint on idempotency_key
Process orders; on duplicate, ignore
Inject duplicate messages; verify only one effect per key

View lab on GitHub

Recap

Key takeaways

Async decouples in time, not in contract - the contract still has to be designed carefully
Pick queue / pub-sub / event-streaming based on whether you need work-distribution, fan-out, or replayable log
Default to at-least-once + consumer-side idempotency; reach for exactly-once only with reason
Always bound your queues; unbounded is a delayed OOM
Watch consumer lag as a first-class metric; it is the early warning of every event-pipeline incident

Related resources

Event-Driven & Asynchronous Systems

Learning objectives

Queue vs Pub/Sub vs Event Streaming - Pick One Deliberately

Kafka in Production

Delivery Guarantees - What Exactly-Once Actually Means

Backpressure

Common Production Failures

Backpressure Propagation

Self-Check Quiz

Where this shows up

Keep these close

What usually breaks

Threats to watch

Design choices you should be able to defend

Kafka

RabbitMQ

NATS / Redis Streams

Other production approaches

Apache Kafka

AWS Kinesis

Apache Pulsar

NATS JetStream

Redis Streams

Questions to answer before shipping

Vocabulary used in this module

Topic

Consumer group

Idempotency

Backpressure

Consumer lag

Hands-on labs

Lab 3.1 - Kafka Event Pipeline End-to-End

Lab 3.2 - Backpressure in a Reactive Pipeline

Lab 3.3 - Idempotent Consumer with Dedup

Key takeaways

Keep learning across CodersSecret

Related guides

Cheatsheets

Interactive labs

Glossary terms