Rate limiting is one of the few infrastructure controls that simultaneously protects you from cost overruns, abuse, cascading failure, and noisy neighbours. Get it right and a single misbehaving client's burst is absorbed before it touches your application. Get it wrong and you either DDoS yourself with retries or open the door to credential stuffing, scraping, and inference-cost bombs.

This guide is a production walk through the rate-limiting algorithms used in real API gateways, edge proxies, and service meshes — what each algorithm gets right, where it falls down, and how distributed systems implement these controls without becoming a coordination bottleneck themselves. Examples are grounded in Redis, Envoy, NGINX, Kubernetes ingress, and the patterns Cloudflare and Fastly publish about their edge networks.

Why Rate Limiting Matters in Production

Five concrete production failures that rate limiting prevents:

  • Credential stuffing: an attacker with a leaked password list testing them against your login endpoint at thousands of attempts per second.
  • Inference-cost abuse: a single AI API consumer running 4 million calls in a week against a paid LLM endpoint, racking up tens of thousands in compute cost.
  • Scraping: a competitor or LLM trainer downloading your entire product catalogue, search index, or pricing data.
  • Retry storms: a downstream brownout triggers exponential retries from every client; without rate limiting at the edge those retries amplify the load and prevent recovery.
  • Bug-induced runaway cost: a misconfigured cron job that calls your billing API in a tight loop. Your own code is the "attacker" and a circuit breaker would not have helped.

The right rate-limiter design depends on which of these you are defending against, where in the network you sit, and whether you can afford to be off by a factor of two during a partition.

The Four Foundational Algorithms

Fixed Window Counter

The simplest algorithm. For each (client, window) pair, increment a counter; when the counter exceeds the limit, reject. The window resets at fixed wall-clock boundaries (every minute, every hour). Implementation in pseudo-Redis:

key   = "rl:" + client_id + ":" + (now / window_seconds)
count = INCR(key)
if count == 1: EXPIRE(key, window_seconds)
if count > limit: return 429
return 200

Pros: trivial to implement, O(1) memory per client per window, easy to reason about.

Cons: the boundary problem. With a 100-req/min limit, a client can send 100 requests in the last second of one window and 100 more in the first second of the next window — effectively 200 requests in 2 seconds while still appearing compliant. The denial-of-service surface is real.

Sliding Window Log

Stores the timestamp of every request in a sorted set per client. To check, drop entries older than the window and count what remains. Conceptually accurate; operationally expensive.

now = time()
ZADD client_id now now
ZREMRANGEBYSCORE client_id 0 (now - window_seconds)
count = ZCARD client_id
if count > limit: return 429
return 200

Pros: exact. No boundary effect. Perfect for low-throughput precision-required scenarios (e.g. enforcing a hard 5-attempts-per-15-minutes login limit).

Cons: O(N) memory per client (N = limit). For a 1000-req/min limit, every active client costs 1000 timestamps in memory. At a million active clients you are storing a billion timestamps. Use only when the limit is small.

Sliding Window Counter

The hybrid that production systems usually pick. Store the count for the current window and the previous window. Estimate the sliding count by linearly interpolating: if you are 25% of the way into the current window, count = current_window_count + 0.75 × previous_window_count.

now           = time()
this_window   = now / window_seconds
prev_window   = this_window - 1
elapsed_pct   = (now % window_seconds) / window_seconds

current = INCR("rl:" + client + ":" + this_window)
EXPIRE  ("rl:" + client + ":" + this_window, 2 * window_seconds)
prev    = GET ("rl:" + client + ":" + prev_window) or 0

estimate = current + prev * (1 - elapsed_pct)
if estimate > limit: return 429
return 200

Pros: O(1) memory per client (two counters), no boundary problem (within ~1% error from the linear approximation), industry standard at large API gateways including Cloudflare.

Cons: the linear interpolation assumes uniform distribution within the previous window — if the previous window's requests were all in the last 10 seconds, the estimate is too low. In practice this matters less than the boundary problem of fixed windows.

SLIDING WINDOW COUNTER T - 60s T (window boundary) T + 60s Previous window prev = 80 Current window current = 30 now (T + 25s) estimate = current + prev × (1 - elapsed_pct) = 30 + 80 × (1 − 0.42) = 30 + 46 = 76

Token Bucket

Each client (or scope) has a bucket of capacity C that refills at rate R tokens per second. Each request takes 1 token (or N tokens for weighted requests). If the bucket is empty, the request is rejected.

// Lazy refill on each request
elapsed     = now - last_refill
new_tokens  = elapsed * refill_rate
tokens      = min(capacity, tokens + new_tokens)
last_refill = now

if tokens >= cost:
  tokens -= cost
  return 200
return 429

Token bucket is the algorithm of choice when you want to allow bursts while enforcing a sustained average. A bucket with C=100 and R=10/s lets a client send 100 requests instantly when the bucket is full, then settle into 10/s. This matches the "real users browse in bursts" pattern much better than a strict per-second rate.

The bucket parameters encode an explicit policy: capacity is the burst budget, refill rate is the sustained throughput. Many systems expose both in their public limits (e.g. AWS DynamoDB's burst capacity, Stripe's 100 req/sec sustained with 25 req/sec burst).

TOKEN BUCKET Bucket: capacity C = 100 75 tokens available Refill: R = 10 tokens/sec Request −1 token 200 OK if tokens ≥ cost Burst budget = C · Sustained throughput = R · Empty bucket → 429

Leaky Bucket

The dual of token bucket: requests fill a queue (the "bucket") and are processed at a fixed rate. Excess requests overflow and are dropped. Leaky bucket smooths a bursty input into a uniform output — useful when downstream cannot handle bursts.

The classic example is shaping outbound traffic to a third-party API with a strict rate. Token bucket would happily fire all your requests as soon as you have tokens; leaky bucket paces them at exactly the allowed rate. Modern implementations (NGINX limit_req, Envoy local rate limit) are leaky-bucket-based.

Quick Comparison

Rule of thumb:

  • Burst friendly + rate limit: token bucket.
  • Strict pacing: leaky bucket.
  • Boundary-safe + memory-cheap: sliding window counter.
  • Boundary-safe + exact: sliding window log (small limits only).
  • Crude + cheap: fixed window counter (only if you can tolerate the boundary effect).

Distributed Rate Limiting

The hard part is not the algorithm — it is making it work across many gateway nodes without each node maintaining its own private counter. If you have 10 ingress replicas and each enforces a 1000-req/min limit independently, your effective limit is 10,000 req/min, and the limit is not really enforced.

Three patterns dominate:

Pattern 1: Centralised Counter (Redis)

Every gateway sends an INCR or Lua-script call to a shared Redis instance per request. The Redis instance is the single source of truth. Strict and simple, but every request adds 1–2ms of network latency, and Redis becomes a single point of failure.

The canonical implementation is a single Lua script that performs the entire token-bucket update atomically:

-- KEYS[1] = bucket key
-- ARGV    = capacity, refill_rate, now, cost
local capacity = tonumber(ARGV[1])
local refill   = tonumber(ARGV[2])
local now      = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4])

local data = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(data[1]) or capacity
local ts     = tonumber(data[2]) or now

tokens = math.min(capacity, tokens + (now - ts) * refill)
if tokens < cost then
  redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", now)
  redis.call("EXPIRE", KEYS[1], 600)
  return 0
end

tokens = tokens - cost
redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", now)
redis.call("EXPIRE", KEYS[1], 600)
return 1

The Lua script ensures atomicity — no two gateways can race on the same client's bucket. With Redis Cluster, the bucket key's hash slot determines which Redis node owns it, so each client's checks are routed to a single node. For multi-region setups, use a per-region Redis with eventual cross-region reconciliation, accepting that limits are enforced regionally.

Pattern 2: Local Token Buckets with Periodic Reconciliation

Each gateway holds a local bucket sized for its share of the global limit (1/N of the global rate, where N is the number of gateways). Every few seconds, gateways exchange usage data via a gossip protocol or a shared store and adjust their local quotas based on observed traffic distribution.

Used by Envoy's global rate limiting (which delegates to a separate gRPC rate-limit service), and by edge networks like Cloudflare for high-volume per-IP limits. The trade-off: under partition or sudden topology change, the limit can be temporarily wrong by a small percentage.

Pattern 3: Edge-First, Origin-Aware

The CDN or edge proxy enforces a permissive global rate to absorb obvious abuse (e.g. 10,000 req/min per IP). The origin enforces a stricter authenticated-user limit (e.g. 1000 req/min per API key). Each layer protects what is behind it from what would not survive without that layer.

DISTRIBUTED RATE LIMITER ARCHITECTURE Edge / CDN Per-IP limit: 10,000/min API Gateway (N replicas) GW 1 GW 2 GW 3 GW N Per-API-key limit via Lua (token bucket in Redis) EVAL Lua allow / deny Redis Cluster Atomic token-bucket script per key key = rl:<api_key> slot → specific node Origin services protected from overload

Production Implementations

NGINX

NGINX's built-in limit_req_zone directive is a leaky-bucket rate limiter. Per-worker by default; for multi-worker correctness it stores state in shared memory.

http {
  limit_req_zone $binary_remote_addr zone=per_ip:10m rate=10r/s;
  limit_req_zone $http_x_api_key     zone=per_key:10m rate=100r/s;

  server {
    location /api/ {
      limit_req zone=per_ip   burst=20 nodelay;
      limit_req zone=per_key  burst=50 nodelay;
      proxy_pass http://upstream;
    }
  }
}

The burst argument is the queue size (in leaky-bucket terms); nodelay means burst requests are processed immediately rather than paced. For per-cluster correctness across many NGINX nodes, use the nginx-redis module or move the limiter to a shared Redis-backed gateway.

Envoy

Envoy supports two layers: local rate limiting (per-instance token bucket, no coordination) and global rate limiting (delegates to an external gRPC service that holds the shared state, typically Lyft's open-source ratelimit service backed by Redis).

http_filters:
- name: envoy.filters.http.local_ratelimit
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
    stat_prefix: http_local_rate_limit
    token_bucket:
      max_tokens: 100
      tokens_per_fill: 10
      fill_interval: 1s
    filter_enabled: { default_value: { numerator: 100 } }
    filter_enforced: { default_value: { numerator: 100 } }

The local filter is the right tool for "protect this pod from overload"; the global filter is the right tool for "enforce a per-customer quota across the fleet". They compose — you typically run both.

Kubernetes Ingress

Most Kubernetes ingress controllers expose rate limiting via annotations. NGINX Ingress Controller:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api
  annotations:
    nginx.ingress.kubernetes.io/limit-rps: "20"
    nginx.ingress.kubernetes.io/limit-rpm: "600"
    nginx.ingress.kubernetes.io/limit-connections: "10"
spec:
  ...

For Istio/Envoy ingress, define an EnvoyFilter or use the gateway API's rate-limiting policy. For more granular control (per-API-key, per-tenant), most teams converge on a dedicated API gateway (Kong, Tyk, or a custom Envoy-based gateway) sitting behind the ingress.

Cloud-Managed Gateways

AWS API Gateway, Google Cloud Endpoints, and Azure API Management all expose rate limiting natively. Their algorithms are typically token-bucket-based with per-region scope. The trade-off versus self-hosting is operational cost vs control: managed gateways are easy to operate and hard to customise.

Multi-Region Rate Limiting

Multi-region adds a coordination problem on top of an already-coordination-heavy primitive. The honest engineering choice is between three models, each with a different latency / consistency trade-off:

Per-Region Independent Limits

Each region runs its own Redis (or its own gateway-local counters) and enforces a regional quota. A customer with a 1000 req/min global limit gets 1000 req/min in each region they hit. Simple, very low latency (no cross-region calls), but a customer can effectively multiply their limit by routing across regions. Acceptable when the limit is loose-by-design (DDoS volumetric, free-tier evaluation) and the cost asymmetry of cross-region travel discourages abuse anyway.

Global Aggregation with Regional Caches

Each region keeps a local counter. Periodically (every 1–5 seconds), regions push their deltas to a global aggregator that reconciles a true global count and pushes back per-region quotas based on observed traffic distribution. The model behind Envoy's global rate-limiting service running in cell-aware mode, and the pattern Cloudflare publishes for their edge fleet. Bounded inconsistency — an attacker can briefly exceed the global limit in the aggregation window — but throughput is regional and latency stays sub-millisecond.

Globally-Routed Rate-Limit Service

Every gateway, in every region, calls a single global rate-limit service for every check. Strict global consistency. Adds 50–200ms of cross-region latency to every request, which makes this the wrong choice for any user-facing path. Used only for authoritative scenarios where exact billing-grade enforcement matters (paid API quotas where overage costs real money) and the rate-limit service is not on the hot path of latency-sensitive traffic.

Practical Multi-Region Pattern

The pattern most production teams converge on is layered: per-region independent limits for volumetric / DDoS / abuse defense (no cross-region cost), global aggregation with regional caches for per-customer business quotas (small consistency window is fine), and billing-time reconciliation for hard contractual quotas (after-the-fact ledgers, not request-time enforcement). Each layer enforces what the latency budget allows.

For globally distributed systems, the choice is the same as the consistency choice covered in the Distributed Systems Algorithms guide — you cannot have both global strong consistency and per-region low latency. The rate-limiter design just makes the trade-off explicit.

Adaptive Rate Limiting

Static rate limits assume you know the right number in advance. Real systems learn it. Adaptive rate limiting (also called "load shedding" in some literatures) adjusts the allowed rate based on observed load — if the upstream is responding slowly, lower the rate; if everything is healthy, raise it.

The classic algorithm is AIMD (Additive Increase Multiplicative Decrease, borrowed from TCP congestion control): on every successful response, increase the limit by a small constant; on every error or timeout, multiply the limit by a small fraction (typically 0.5). The system finds the maximum sustainable rate dynamically.

Netflix's open-source concurrency-limits library applies AIMD as an in-process limiter; Sentinel (Alibaba) does similar at the framework level. Cloudflare's Bot Fight Mode applies adaptive limits per IP based on observed behaviour patterns.

Concurrency-Based Limits (vs Rate-Based)

Closely related: instead of rate-limiting requests-per-second, limit concurrent in-flight requests. This is more honest about back-pressure — if your service can handle 100 concurrent requests with acceptable latency, capping concurrency at 100 directly enforces that. Little's Law connects the two: average_concurrency = arrival_rate × average_latency.

Concurrency-based limits adapt naturally to slow upstreams. If the upstream slows from 10ms to 100ms, fewer requests fit in the same concurrency budget, automatically throttling the load. AWS uses concurrency limits extensively in their internal services.

DDoS Mitigation

Application-layer rate limiting buys you minutes against a small attack and seconds against a large one. Real DDoS protection happens upstream: in the CDN layer (Cloudflare, Fastly, Akamai), in BGP-level scrubbing services (AWS Shield Advanced, Cloudflare Magic Transit), and in the kernel-level filters that reject obvious attack packets before they hit your application.

The application-layer rate limiter's job in a DDoS scenario is mostly to identify and isolate attack patterns so the CDN or scrubbing layer can act on the signal. Common patterns:

  • Per-IP + per-ASN limits to detect botnets that span many IPs from a small set of providers.
  • Per-User-Agent fingerprint limits to detect bot fleets that all advertise the same UA.
  • JA3 / TLS fingerprint limits for clients that use the same TLS configuration (often a tell of automated tooling).
  • Behavioural anomaly detection — clients that hit endpoints in a non-human pattern (e.g. all 50 product pages in 2 seconds) get flagged regardless of rate.

Common Pitfalls

Trusting the X-Forwarded-For Header

If your rate limiter keys on client IP and you read the IP from X-Forwarded-For, an attacker can spoof the header and rotate IPs trivially. Only trust XFF when set by an upstream proxy you control, and only the rightmost portion (your own proxies' values, not the original client's claim).

Using Burst as Capacity

NGINX's limit_req with a high burst and nodelay behaves like a fixed window from the user's perspective — once the burst is consumed they get rate-limited, and the "average rate" intuition breaks. Tune burst conservatively and prefer separate limit zones for different scopes.

Counting Failed Requests Toward the Limit

If failed requests count, an attacker can deny service by making the victim's requests fail. Common with login endpoints: the attacker sends bad credentials with the victim's identifier, the limiter increments, and the legitimate user is rate-limited. Use exempt-on-failure for scenarios where the attacker controls the "client" identifier.

Forgetting to Limit Logged-In Users

Most teams limit per-IP at the edge but not per-user inside the gateway. A compromised account or misbehaving paid customer can blow past per-IP limits using a residential proxy network. Combine: per-IP at the edge, per-API-key in the gateway, per-user-action quotas at the application layer.

Security Considerations

Rate limiting is a security control. Two specific patterns matter:

  1. Authentication endpoint rate limiting: enforce strict limits on /login, /forgot-password, /verify-otp — ideally with separate limits per IP and per username. The classic credential-stuffing defense.
  2. Cost-based rate limiting for paid services: not all requests are equal. An LLM call costs more than a simple GET. Rate limits should reflect cost — either by varying token cost in the bucket (an LLM call costs 10 tokens, a GET costs 1) or by separate buckets per cost class.

Walk through the scenarios in the API Attack & Defense Simulator to practice spotting JWT, OAuth, rate-limit, and CORS bypasses against rate-limited endpoints. For the broader API security picture, the Kubernetes Authentication & Authorization module in the free Cloud Native Security Engineering course covers the full stack.

Observability

The rate limiter that does not emit metrics is just guessing. Minimum signals:

  • Per-scope acceptance rate: ratio of allowed to total requests, broken down by client / API-key / route.
  • 429 response count with the specific rule that triggered it (so you can answer "which limit are we hitting most?").
  • Bucket utilisation distribution: are clients consistently near the limit (suggesting the limit is too tight) or rarely near it (suggesting wasted budget)?
  • Latency added by the rate-limit check: in the centralised-Redis pattern, this is your tax. If it grows past 5ms, investigate.
  • Errors from the rate limiter itself (Redis timeouts, gRPC errors): these should default to fail-open or fail-closed by deliberate choice, not by accident.

Frequently Asked Questions

Should I fail open or fail closed if the rate limiter is unavailable?

Depends on what you are protecting. For abuse protection on a hot endpoint (login, payment), fail closed — better to reject all traffic than let an attack through. For routine product traffic, fail open — better to let everyone through than block paying customers because of a Redis blip. Make the choice deliberately and document it.

How do I rate-limit by JWT subject without parsing the JWT on every gateway?

Either trust the upstream auth proxy to inject a verified header (e.g. X-User-ID) and key off that, or have the gateway verify the JWT once and cache the claims by JTI. The gateway-verifies pattern is more robust under federated auth.

Is sliding window log ever the right choice in production?

For low-volume strict-limit scenarios, yes — e.g. enforcing "5 password reset requests per email per hour" where the limit is small and the precision matters. For anything high-volume, sliding window counter or token bucket are better.

How do I rate-limit websockets / long-lived connections?

Limit two distinct things: connection rate (new connections per IP / API-key per second) and per-connection message rate (messages per connection per second). Token bucket per connection is a clean fit for the second; standard rate limiters for the first.

How do CDNs implement edge rate limiting at internet scale?

Edge nodes maintain local approximate counters per (IP, rule) and gossip aggregates back to a regional aggregator every few seconds. The aggregator computes the global rate and tells edge nodes to throttle if they cross thresholds. Counts are eventually consistent — an attacker can briefly exceed the limit before the aggregator catches up — but the system handles trillions of requests per day.

Should rate limits be public or hidden?

Public for legitimate users (so they can build clients that respect the limits). Returns the standard X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset headers. Hidden for abuse-detection rules — if an attacker knows the threshold for triggering bot-detection, they can stay just under it.

Conclusion

Rate limiting is the most universally applicable defensive control in modern infrastructure. Get the algorithm right and a misbehaving client's burst is absorbed before it touches your application. Get it wrong and you DDoS yourself with retries, leak free inference compute, or block legitimate users while attackers rotate IPs and walk past you.

The high-leverage takeaways: token bucket for burst-friendly user-facing APIs, leaky bucket for strict downstream pacing, sliding-window counter as the boundary-safe default at scale; centralised Redis with a Lua script is the safest distributed pattern; local + reconcile is the lowest-latency one; combine layers — CDN for volumetric, gateway for per-API-key, application for per-action; treat authentication endpoints as a special class with stricter limits and dual per-IP / per-username scoping; cost-based weighting for paid services so an LLM call costs more bucket capacity than a GET. Decide deliberately whether the limiter fails open or closed, instrument the actual acceptance rate, and tune from real data, not guesses.

Where to Go Next