You need to rename a column, add a NOT NULL constraint, or change a data type. In development, you run the migration and it takes 200 milliseconds. In production with 50 million rows, that same migration locks the table for 45 minutes and your app goes down.

This guide teaches the expand-contract pattern — the industry-standard approach to zero-downtime schema changes used by companies like GitHub, Shopify, and Stripe.

Why Migrations Cause Downtime

Most schema changes in PostgreSQL and MySQL acquire exclusive locks on the table. While the lock is held, no reads or writes can proceed. On large tables, the migration itself can take minutes or hours.

-- This innocent-looking migration:
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50) NOT NULL DEFAULT '';

-- On a 50M row table in PostgreSQL < 11:
-- 1. Acquires ACCESS EXCLUSIVE lock (blocks ALL queries)
-- 2. Rewrites every single row to add the default value
-- 3. Rebuilds all indexes
-- Duration: 30-60 minutes of TOTAL DOWNTIME

PostgreSQL 11+ improvement: Adding a column with a constant DEFAULT no longer rewrites the table. But many other operations (adding constraints, changing types, renaming) still require careful handling.

The Expand-Contract Pattern

Instead of making a breaking change in one step, split it into three safe phases:

  1. Expand: Add the new structure alongside the old one (backward compatible)
  2. Migrate: Update application code to use the new structure. Backfill data.
  3. Contract: Remove the old structure once nothing depends on it

Example: Renaming a Column

You want to rename user_name to display_name. A direct rename locks the table and breaks all queries referencing the old name.

-- WRONG: Direct rename (causes downtime + app errors)
ALTER TABLE users RENAME COLUMN user_name TO display_name;
-- Every query using "user_name" immediately breaks!

-- RIGHT: Expand-Contract (zero downtime)

-- Phase 1: EXPAND - Add new column
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Phase 2: MIGRATE - Backfill data in batches
UPDATE users SET display_name = user_name
WHERE display_name IS NULL
AND id BETWEEN 1 AND 100000;  -- Batch by ID range

-- Phase 2b: Deploy code that writes to BOTH columns
-- Phase 2c: Deploy code that reads from display_name (falls back to user_name)
-- Phase 2d: Deploy code that reads ONLY from display_name

-- Phase 3: CONTRACT - Remove old column (after verification)
ALTER TABLE users DROP COLUMN user_name;

Example: Adding a NOT NULL Constraint

-- WRONG: Direct NOT NULL (scans entire table with lock)
ALTER TABLE orders ALTER COLUMN customer_email SET NOT NULL;

-- RIGHT: Add constraint without validation, then validate separately

-- Phase 1: Add constraint as NOT VALID (instant, no scan)
ALTER TABLE orders
ADD CONSTRAINT orders_email_not_null
CHECK (customer_email IS NOT NULL) NOT VALID;

-- Phase 2: Backfill any NULL values
UPDATE orders SET customer_email = 'unknown@example.com'
WHERE customer_email IS NULL;

-- Phase 3: Validate constraint (reads table but does NOT lock writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_email_not_null;

Backfilling Data Safely

Never backfill an entire table in one UPDATE. It creates a massive transaction that bloats WAL, locks rows, and can overwhelm your database.

# Django management command for batched backfill
from django.db import connection

def backfill_display_name(batch_size=10000):
    while True:
        with connection.cursor() as cursor:
            cursor.execute("""
                UPDATE users
                SET display_name = user_name
                WHERE display_name IS NULL
                AND id IN (
                    SELECT id FROM users
                    WHERE display_name IS NULL
                    LIMIT %s
                )
            """, [batch_size])

            rows_updated = cursor.rowcount

        if rows_updated == 0:
            break

        print(f"Updated {rows_updated} rows")
        time.sleep(0.1)  # Brief pause to reduce load

Index Changes Without Downtime

-- WRONG: CREATE INDEX locks the table for the entire duration
CREATE INDEX idx_orders_email ON orders(customer_email);
-- On 50M rows: 10-30 minutes of blocked writes

-- RIGHT: CREATE INDEX CONCURRENTLY (no lock on writes)
CREATE INDEX CONCURRENTLY idx_orders_email ON orders(customer_email);
-- Takes the same time but does NOT block any queries
-- Caveat: cannot run inside a transaction block

Django Migration Best Practices

# migrations/0042_add_display_name.py
from django.db import migrations, models

class Migration(migrations.Migration):
    # CRITICAL: Prevent Django from wrapping in a transaction
    # Required for CREATE INDEX CONCURRENTLY
    atomic = False

    dependencies = [
        ('users', '0041_previous'),
    ]

    operations = [
        # Add nullable column (instant, no rewrite)
        migrations.AddField(
            model_name='user',
            name='display_name',
            field=models.CharField(max_length=255, null=True),
        ),

        # Add index concurrently
        migrations.RunSQL(
            sql='CREATE INDEX CONCURRENTLY idx_user_display ON users_user(display_name);',
            reverse_sql='DROP INDEX IF EXISTS idx_user_display;',
        ),
    ]

Dangerous Operations Cheat Sheet

Operation Risk Level Safe Alternative
ADD COLUMN (nullable) Safe Direct (instant in PostgreSQL)
ADD COLUMN with DEFAULT Safe (PG 11+) Direct (PG 11+) or add nullable + backfill
ADD COLUMN NOT NULL Dangerous Add nullable → backfill → add CHECK NOT VALID → validate
DROP COLUMN Moderate Remove from code first, then drop in next deploy
RENAME COLUMN Dangerous Expand-contract: add new → dual write → drop old
CHANGE TYPE Dangerous Add new column → backfill → swap
ADD INDEX Dangerous CREATE INDEX CONCURRENTLY
ADD FOREIGN KEY Dangerous ADD CONSTRAINT NOT VALID → VALIDATE
ADD UNIQUE CONSTRAINT Dangerous CREATE UNIQUE INDEX CONCURRENTLY → ADD CONSTRAINT USING INDEX

Deployment Order Matters

The golden rule: migration and code must be backward compatible at every step. Never deploy code that depends on a migration that has not run yet, and never run a migration that breaks currently deployed code.

# Safe deployment order for renaming a column:

# Deploy 1: Add new column (migration only)
# - Old code works fine (ignores new column)
# - New column exists but is empty

# Deploy 2: Write to both columns (code change)
# - Writes go to both user_name AND display_name
# - Reads still from user_name

# Deploy 3: Backfill historical data (script/migration)
# - All rows now have display_name populated

# Deploy 4: Read from new column (code change)
# - Reads from display_name
# - Still writes to both (in case of rollback)

# Deploy 5: Stop writing to old column (code change)
# - Only writes to display_name

# Deploy 6: Drop old column (migration)
# - Safe: nothing references user_name anymore

Key Takeaways

  • Never run ALTER TABLE on large tables without checking lock behavior — test migrations against production-sized data first
  • Use the expand-contract pattern for any schema change that could break running code
  • Backfill in batches — never update millions of rows in a single transaction
  • Always use CONCURRENTLY for index creation in production
  • Add constraints as NOT VALID first, then validate separately to avoid full table locks
  • Deploy in the right order: migration before code that needs it, code change before migration that removes something
  • PostgreSQL 11+ is your friend — many operations that used to be dangerous are now safe, but always verify

Zero-downtime migrations require more steps than a simple ALTER TABLE, but the alternative is explaining to your users why the app was down for 45 minutes during a “routine update.” The expand-contract pattern is not complex — it is disciplined. And discipline beats downtime every time.