System design interviews and architecture decisions share the same core knowledge: understanding how each layer of a modern web system works, when to introduce it, and what breaks when you get it wrong. This guide walks through every layer from the user’s browser to the database, and ties it all together with a complete URL shortener design.

The Framework: How to Think About System Design

Whether you are designing a system at work or in an interview, follow this sequence:

  1. Requirements: What exactly does the system need to do? (Functional + non-functional)
  2. Estimation: How much traffic, storage, and bandwidth? (Back-of-envelope math)
  3. High-Level Design: Draw the architecture with all major components
  4. Deep Dive: Zoom into the most critical or complex components
  5. Bottlenecks: What breaks at 10x scale? How do you fix it?

Now let us walk through each layer of the architecture.

Layer 1: DNS and CDN

Every request starts with a DNS lookup. The user types app.example.com and the browser resolves it to an IP address. In a production system, this is where your first optimization happens.

  • DNS-based load balancing: Route53 (AWS), Cloud DNS (GCP) can return different IPs based on geography, health checks, or weighted distribution
  • CDN (Content Delivery Network): Static assets (JS, CSS, images) are cached at edge servers worldwide. CloudFront, Cloudflare, or Fastly serve files from the nearest edge — 20ms instead of 200ms
# Typical CDN setup for a web app:
# Origin: your-server.example.com (one region)
# CDN: cdn.example.com (200+ edge locations)

# HTML references the CDN:
# <script src="https://cdn.example.com/app.js"></script>
# <link href="https://cdn.example.com/styles.css">

# First request: CDN fetches from origin, caches at edge
# Subsequent requests: served from edge cache (sub-20ms)

Layer 2: Load Balancer

A load balancer distributes incoming requests across multiple application servers. It is the single most important component for horizontal scaling.

L4 vs L7 Load Balancing

  • L4 (Transport): Routes based on IP/port. Fast, but cannot inspect HTTP headers or URLs. Used for TCP/UDP traffic.
  • L7 (Application): Routes based on HTTP headers, URL paths, cookies. Can do SSL termination, request modification, and content-based routing.

Algorithms

  • Round Robin: Each request goes to the next server in sequence. Simple, works when servers are identical.
  • Least Connections: Routes to the server with the fewest active connections. Better for varying request durations.
  • Consistent Hashing: Routes based on a hash of the request key (user ID, session). Ensures the same user hits the same server — critical for WebSocket or cache-dependent workloads.
# nginx load balancer configuration
upstream backend {
    least_conn;

    server app1.internal:8080 weight=3;
    server app2.internal:8080 weight=3;
    server app3.internal:8080 weight=1;  # smaller instance

    # Health checks
    server app4.internal:8080 backup;    # only if others are down
}

server {
    listen 443 ssl;
    server_name api.example.com;

    # SSL termination happens here
    ssl_certificate     /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    location / {
        proxy_pass http://backend;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Layer 3: API Gateway

An API gateway sits between clients and your microservices, handling cross-cutting concerns so your services do not have to.

  • Rate Limiting: 100 requests/minute per API key. Prevents abuse and protects downstream services.
  • Authentication: Validates JWT tokens, API keys, or OAuth tokens once at the gateway.
  • Request Routing: /api/users/* → User Service, /api/orders/* → Order Service
  • API Versioning: Route /v1/ and /v2/ to different service versions without client changes.
  • Response Caching: Cache GET responses for frequently accessed, rarely changing data.

Popular options: Kong, AWS API Gateway, Envoy Proxy (covered in our Envoy blog), or nginx with OpenResty.

Layer 4: Application Servers

Your actual business logic runs here. The key design principle: stateless servers.

  • Stateless: No server stores user sessions, uploaded files, or cache locally. Everything goes to external stores (Redis, S3, database). This means any server can handle any request — enabling horizontal scaling.
  • Horizontal Scaling: Add more servers behind the load balancer. Kubernetes makes this automatic with Horizontal Pod Autoscaler (HPA).
# Kubernetes HPA: auto-scale based on CPU usage
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Layer 5: Caching Layer

Caching is the single most effective way to improve performance and reduce database load. A well-designed cache can handle 100x the read throughput of your database.

Cache Patterns

  • Cache-Aside (Lazy Loading): Application checks cache first. On miss, reads from database and populates cache. Most common pattern.
  • Write-Through: Every database write also writes to cache. Cache is always up to date, but adds write latency.
  • Write-Behind: Writes go to cache immediately, then asynchronously flushed to database. Fastest writes, but risk of data loss.
# Cache-aside pattern with Redis (Python)
import redis
import json

cache = redis.Redis(host='cache.internal', port=6379)

def get_user(user_id: str) -> dict:
    # Step 1: Check cache
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # Step 2: Cache miss - read from database
    user = db.query("SELECT * FROM users WHERE id = %s", [user_id])

    # Step 3: Populate cache with TTL
    cache.setex(f"user:{user_id}", 3600, json.dumps(user))  # 1 hour TTL

    return user

def update_user(user_id: str, data: dict):
    # Update database
    db.execute("UPDATE users SET ... WHERE id = %s", [user_id])

    # Invalidate cache (not update - avoids race conditions)
    cache.delete(f"user:{user_id}")

Layer 6: Message Queue

Message queues decouple services and enable asynchronous processing. Instead of Service A directly calling Service B, Service A publishes a message that Service B consumes when ready.

  • Decoupling: Services evolve independently. The order service does not need to know about the email service.
  • Load Leveling: Absorb traffic spikes. If 10,000 orders arrive in one second, the queue holds them while workers process at a sustainable rate.
  • Reliability: If a consumer crashes, messages remain in the queue. No data loss.
# Producer: Order service publishes event
import json
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='kafka.internal:9092',
    value_serializer=lambda v: json.dumps(v).encode()
)

def create_order(order_data):
    # Save to database
    order = db.save_order(order_data)

    # Publish event (async - does not block the response)
    producer.send('order-events', {
        'event': 'order_created',
        'order_id': order.id,
        'customer_id': order.customer_id,
        'total': order.total
    })

    return order

# Consumer: Email service listens for events
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'order-events',
    bootstrap_servers='kafka.internal:9092',
    group_id='email-service'
)

for message in consumer:
    event = json.loads(message.value)
    if event['event'] == 'order_created':
        send_order_confirmation_email(event['order_id'])

Layer 7: Database

SQL vs NoSQL Decision

  • SQL (PostgreSQL, MySQL): When you need ACID transactions, complex joins, and strong consistency. Default choice for most applications.
  • NoSQL (MongoDB, DynamoDB, Cassandra): When you need flexible schemas, horizontal write scaling, or specific access patterns (key-value, document, wide-column).

Scaling Reads: Replicas

# Primary handles writes, replicas handle reads
# Application-level read/write splitting:

def get_user(user_id):
    return read_replica.query("SELECT * FROM users WHERE id = %s", [user_id])

def update_user(user_id, data):
    return primary.execute("UPDATE users SET ... WHERE id = %s", [user_id])

Scaling Writes: Sharding

When a single primary cannot handle write volume, split data across multiple databases:

  • Hash Sharding: shard = hash(user_id) % num_shards. Even distribution, but hard to add shards.
  • Range Sharding: Users A-M on shard 1, N-Z on shard 2. Simple, but can create hotspots.
  • Directory Sharding: A lookup service maps each key to a shard. Most flexible, but the directory is a single point of failure.

Putting It All Together: Design a URL Shortener

Requirements

  • Shorten a long URL to a short code (e.g., example.com/abc123)
  • Redirect short URLs to original URLs
  • Track click analytics
  • 100 million URLs created per month, 10:1 read-to-write ratio

Back-of-Envelope Estimation

# Writes: 100M / month
#   = ~3.3M / day
#   = ~38 / second (average)
#   = ~190 / second (5x peak)

# Reads: 1B / month (10:1 ratio)
#   = ~33M / day
#   = ~380 / second (average)
#   = ~1,900 / second (5x peak)

# Storage (5 years):
#   100M * 12 months * 5 years = 6 billion URLs
#   Each URL: ~500 bytes (short code + original URL + metadata)
#   Total: 6B * 500B = 3 TB

# Bandwidth:
#   Reads: 380 req/s * 500B = 190 KB/s (trivial)

# Short code length:
#   Base62 (a-z, A-Z, 0-9): 62^7 = 3.5 trillion combinations
#   7 characters is sufficient for 6 billion URLs

Architecture

URL Shortener Architecture
Client
Browser / API
CDN
Cache redirects
Load Balancer
L7 / nginx
App Server
Stateless API
Redis Cache
URL lookups
PostgreSQL
URL storage

Write Path (Create Short URL)

# 1. Client sends POST /api/shorten with the long URL
# 2. App server generates a unique 7-character base62 code
# 3. Store in database: {code, original_url, created_at, user_id}
# 4. Store in Redis cache: code -> original_url
# 5. Return short URL to client

import hashlib
import base64

def generate_short_code(url: str, counter: int) -> str:
    """Generate unique 7-char code using URL + counter."""
    raw = hashlib.md5(f"{url}{counter}".encode()).digest()
    encoded = base64.b62encode(raw)[:7]
    return encoded.decode()

Read Path (Redirect)

# 1. Client requests GET /abc123
# 2. Check CDN cache (hit? -> 301 redirect immediately)
# 3. Check Redis cache (hit? -> 301 redirect)
# 4. Check database (hit? -> populate caches, 301 redirect)
# 5. Not found -> 404

# Async: publish click event to Kafka for analytics
producer.send('click-events', {
    'code': 'abc123',
    'timestamp': now,
    'ip': request.remote_addr,
    'user_agent': request.headers['User-Agent']
})

Common Interview Mistakes

  • Jumping to a solution without understanding requirements: Always spend the first 3-5 minutes asking clarifying questions.
  • Over-engineering from the start: Begin with a simple, working design. Add complexity (sharding, queues, caching) only when the numbers justify it.
  • Ignoring estimation: If your system handles 100 requests/second, you do not need Kafka and 20 microservices. A single PostgreSQL instance handles 10,000+ QPS for reads.
  • Forgetting failure modes: What happens if Redis goes down? What if the database master fails? Always discuss fallback strategies.
  • Not discussing tradeoffs: Every decision has a tradeoff. Cache invalidation vs. stale data. Strong consistency vs. availability. Acknowledge them.

Quick Reference: Component Cheat Sheet

Component When to Use Popular Tools
CDN Static assets, global users CloudFront, Cloudflare, Fastly
Load Balancer Multiple app servers nginx, ALB, HAProxy
API Gateway Microservices, rate limiting Kong, Envoy, AWS API Gateway
Cache Read-heavy, repeated queries Redis, Memcached
Message Queue Async processing, decoupling Kafka, RabbitMQ, SQS
SQL Database Transactions, joins, consistency PostgreSQL, MySQL
NoSQL Database Flexible schema, massive scale MongoDB, DynamoDB, Cassandra
Object Storage Files, images, backups S3, GCS, Azure Blob
Search Engine Full-text search, facets Elasticsearch, Meilisearch

Key Takeaways

  • Start simple, scale when needed — a single server with PostgreSQL handles more traffic than most people think
  • Stateless servers enable horizontal scaling — store all state in external systems (Redis, database, S3)
  • Caching is your biggest performance lever — a Redis cache in front of your database can handle 100x the read throughput
  • Message queues decouple and absorb spikes — essential for reliable async processing
  • Always do back-of-envelope math — it prevents both over-engineering and under-provisioning
  • Every component adds complexity — only add a layer when you have a concrete problem it solves
  • Discuss tradeoffs, not just solutions — strong consistency vs. availability, cost vs. performance, simplicity vs. scalability

System design is not about memorizing architectures — it is about understanding the building blocks, knowing their tradeoffs, and assembling them to meet specific requirements. Master the layers in this guide and you can design (or discuss) any system with confidence.