Database Indexing Secrets Guide

You added an index. The query is still slow. Sound familiar? Most developers treat indexes like magic - add one and hope for the best. But indexes are data structures with specific rules, and violating those rules means your “indexed” query is still doing a full table scan.

This guide covers what your senior dev never explained: how indexes actually work internally, why column order in composite indexes matters more than you think, and how to read EXPLAIN ANALYZE output like a database engineer.

How Databases Find Your Data

When you run a query, the database has two fundamental strategies:

Sequential Scan: Read every single row in the table and check if it matches your WHERE clause. Simple, but devastating on large tables.
Index Scan: Use a pre-built data structure to jump directly to matching rows. Like using a book index instead of reading every page.

Here is the difference in practice:

-- Table: orders (10 million rows)
-- No index on customer_id

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

-- Seq Scan on orders  (cost=0.00..185432.00 rows=52 width=96)
--   Filter: (customer_id = 42)
--   Rows Removed by Filter: 9999948
--   Planning Time: 0.085 ms
--   Execution Time: 1247.531 ms   ← Over 1 second!

-- Now with an index:
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

-- Index Scan using idx_orders_customer_id on orders
--   Index Cond: (customer_id = 42)
--   Planning Time: 0.092 ms
--   Execution Time: 0.128 ms      ← 10,000x faster!

The sequential scan examined 10 million rows to find 52 matches. The index scan went straight to those 52 rows. That is the power of proper indexing.

B-Tree Indexes: The Workhorse

B-tree (balanced tree) is the default index type in PostgreSQL, MySQL, and virtually every relational database. Understanding its structure explains most indexing behavior.

A B-tree is a sorted, self-balancing tree structure:

          [50, 100]              ← Root node
         /    |     \
   [20,35] [60,80] [120,150]    ← Internal nodes
   /  |  \   / | \   /  |  \
  ... ... ... ... ... ... ...   ← Leaf nodes (actual row pointers)

Root and internal nodes act as signposts, directing the search left or right
Leaf nodes contain the indexed values and pointers (ctid) to the actual table rows
All leaf nodes are linked in a doubly-linked list for efficient range scans
Tree depth is typically 3-4 levels even for millions of rows

This means finding any single value requires only 3-4 disk reads, regardless of table size. A table with 100 rows and a table with 100 million rows both need roughly the same number of index lookups.

What B-Trees Are Great At

-- Equality: direct lookup
WHERE email = 'user@example.com'

-- Range: walk the linked leaf nodes
WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31'

-- Sorting: leaf nodes are already sorted
ORDER BY created_at DESC LIMIT 20

-- Prefix matching
WHERE name LIKE 'John%'    -- Uses index
WHERE name LIKE '%John'    -- Cannot use index (no prefix)

Hash Indexes: The Specialist

Hash indexes use a hash function to map values to buckets. They are faster than B-trees for exact equality lookups but useless for everything else.

CREATE INDEX idx_users_email_hash ON users USING hash(email);

-- Uses the hash index (equality only):
WHERE email = 'user@example.com'     -- Yes

-- Cannot use hash index:
WHERE email LIKE 'user%'             -- No (pattern)
WHERE email > 'a' AND email < 'm'   -- No (range)
ORDER BY email                       -- No (sorting)

When to use hash indexes: Only when you exclusively do exact-match lookups on high-cardinality columns (like UUIDs or email addresses) and never need range queries or sorting. In practice, B-tree is almost always the better choice because the performance difference is marginal and B-trees are far more versatile.

Composite Indexes: Column Order Is Everything

A composite index indexes multiple columns together. The column order determines which queries can use the index. This is the leftmost prefix rule - the most misunderstood concept in database indexing.

CREATE INDEX idx_orders_composite ON orders(customer_id, status, created_at);

This single index can satisfy these queries:

-- Uses index (all three columns, left to right):
WHERE customer_id = 42 AND status = 'shipped' AND created_at > '2026-01-01'

-- Uses index (first two columns):
WHERE customer_id = 42 AND status = 'shipped'

-- Uses index (first column only):
WHERE customer_id = 42

-- CANNOT use index (skips first column):
WHERE status = 'shipped'                        -- Skips customer_id!

-- CANNOT use index (skips middle column):
WHERE customer_id = 42 AND created_at > '2026-01-01'  -- Only uses customer_id part

Think of it like a phone book sorted by last name, then first name, then city. You can look up everyone named “Smith” (last name). You can look up “Smith, John” (last + first). But you cannot efficiently look up everyone named “John” without a last name - the book is not sorted that way.

Ordering Strategy

Put columns in this order:

Equality conditions first (WHERE status = 'active')
Range conditions last (WHERE created_at > '2026-01-01')
High-selectivity columns before low-selectivity (customer_id before status)

Covering Indexes: Skip the Table Entirely

Normally, an index scan finds the matching row pointers, then fetches the actual rows from the table (a “heap fetch”). A covering index includes all the columns your query needs, eliminating the heap fetch entirely.

-- Query that needs customer_id, status, and total_amount:
SELECT customer_id, status, total_amount
FROM orders
WHERE customer_id = 42 AND status = 'shipped';

-- Regular index: finds rows, then fetches total_amount from table
CREATE INDEX idx_orders_cust_status ON orders(customer_id, status);

-- Covering index: includes total_amount, no table fetch needed
CREATE INDEX idx_orders_covering ON orders(customer_id, status)
  INCLUDE (total_amount);

In EXPLAIN output, a covering index shows “Index Only Scan” instead of “Index Scan” - this is significantly faster because it avoids random I/O to the heap.

EXPLAIN ANALYZE SELECT customer_id, status, total_amount
FROM orders WHERE customer_id = 42 AND status = 'shipped';

-- Index Only Scan using idx_orders_covering on orders
--   Index Cond: ((customer_id = 42) AND (status = 'shipped'))
--   Heap Fetches: 0            ← Zero table access!
--   Execution Time: 0.045 ms

Partial Indexes: Index Only What Matters

Why index 10 million rows when your query only ever looks at 50,000 of them? A partial index includes a WHERE clause that limits which rows are indexed.

-- Only 2% of orders are 'pending', but you query them constantly
CREATE INDEX idx_orders_pending ON orders(created_at)
  WHERE status = 'pending';

-- This index is 50x smaller than a full index
-- Faster to scan, faster to maintain, less disk space

-- Uses the partial index:
SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;

-- Cannot use it:
SELECT * FROM orders WHERE status = 'shipped' ORDER BY created_at;

Partial indexes are perfect for: active/inactive flags, unprocessed queues, soft-deleted records, and any column where you only query a small subset of values.

Index Bloat and Maintenance

PostgreSQL uses MVCC (Multi-Version Concurrency Control), which means UPDATE and DELETE operations leave dead tuples in indexes. Over time, indexes bloat - they grow larger without holding more useful data.

-- Check index bloat using pg_stat_user_indexes:
SELECT
  schemaname || '.' || relname AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS times_used,
  idx_tup_read AS rows_read
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;

If an index is large but idx_scan is zero, you are paying storage and write overhead for an index nobody uses. Drop it.

-- Rebuild a bloated index (locks the table briefly):
REINDEX INDEX idx_orders_customer_id;

-- Non-blocking rebuild (PostgreSQL 12+):
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;

Common Anti-Patterns

1. Functions on Indexed Columns

-- This CANNOT use an index on created_at:
WHERE YEAR(created_at) = 2026
WHERE LOWER(email) = 'user@example.com'
WHERE amount * 1.1 > 100

-- Fix: rewrite to keep the column bare
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'

-- Or create an expression index:
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

2. Indexing Low-Cardinality Columns

-- Bad: boolean column with only 2 possible values
CREATE INDEX idx_users_active ON users(is_active);
-- The index doesn't help much: 50% of rows match each value
-- The planner will choose a sequential scan anyway

-- Better: partial index if you only query one side
CREATE INDEX idx_users_active ON users(created_at) WHERE is_active = true;

3. Over-Indexing

Every index slows down INSERT, UPDATE, and DELETE operations because the database must update every affected index. A table with 15 indexes means every single write operation triggers 15 index updates. Only create indexes that serve actual query patterns.

4. Implicit Type Casting

-- Column user_id is INTEGER, but you pass a string:
WHERE user_id = '42'
-- PostgreSQL may cast and still use the index, but MySQL often will not.
-- Always match types exactly.

EXPLAIN ANALYZE: Reading the Execution Plan

Here is a real optimization walkthrough. Suppose you have this slow query:

EXPLAIN ANALYZE
SELECT o.id, o.total_amount, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > '2026-01-01'
ORDER BY o.created_at DESC
LIMIT 20;

-- BEFORE optimization:
-- Sort  (cost=45123.45..45123.50 rows=20)
--   Sort Key: o.created_at DESC
--   Sort Method: top-N heapsort  Memory: 27kB
--   -> Hash Join  (cost=1234.56..45000.00 rows=48576)
--        Hash Cond: (o.customer_id = c.id)
--        -> Seq Scan on orders o  (cost=0.00..43210.00 rows=48576)
--              Filter: ((status = 'pending') AND (created_at > '2026-01-01'))
--              Rows Removed by Filter: 9951424
--        -> Hash  (cost=1000.00..1000.00 rows=50000)
--              -> Seq Scan on customers c
-- Planning Time: 0.234 ms
-- Execution Time: 892.456 ms

The bottleneck is the Seq Scan on orders - examining 10 million rows. Let us fix it:

-- Create a targeted composite index:
CREATE INDEX idx_orders_status_created ON orders(status, created_at DESC)
  INCLUDE (total_amount, customer_id);

-- AFTER optimization:
-- Limit  (cost=0.56..45.23 rows=20)
--   -> Nested Loop  (cost=0.56..109.45 rows=48576)
--        -> Index Only Scan Backward using idx_orders_status_created on orders o
--              Index Cond: ((status = 'pending') AND (created_at > '2026-01-01'))
--              Heap Fetches: 0
--        -> Index Scan using customers_pkey on customers c
--              Index Cond: (id = o.customer_id)
-- Planning Time: 0.187 ms
-- Execution Time: 0.342 ms

From 892ms to 0.3ms - a 2,600x improvement. The key changes: the composite index matches the WHERE + ORDER BY, and the INCLUDE clause makes it a covering index (Index Only Scan, zero heap fetches).

Index Type Cheat Sheet

Index Type	Best For	Supports Range	Supports Sort	Size
B-tree (default)	Almost everything	Yes	Yes	Medium
Hash	Equality-only lookups	No	No	Small
GIN	Full-text search, arrays, JSONB	No	No	Large
GiST	Geospatial, range types, nearest-neighbor	Yes	No	Medium
BRIN	Large tables with natural ordering (timestamps)	Yes	No	Tiny
Partial	Queries targeting a small subset of rows	Depends on base type	Depends on base type	Tiny
Covering	Avoiding heap fetches for known queries	Yes (B-tree)	Yes (B-tree)	Larger

Key Takeaways

Always check EXPLAIN ANALYZE before and after adding indexes - do not guess
Column order in composite indexes matters - equality columns first, range columns last
Covering indexes eliminate heap fetches - use INCLUDE for frequently selected columns
Partial indexes save space and speed - index only the rows you actually query
Audit unused indexes regularly - every index has a write cost
Never apply functions to indexed columns in WHERE clauses - use expression indexes instead
B-tree is the right choice 95% of the time - only reach for specialized types when you have a specific need

The difference between a junior and senior database engineer is not knowing that indexes exist - it is knowing which index to create, in what order, with which columns included. Master these fundamentals and you will never fear a slow query again.

Database Indexing Secrets: Why Your Queries Are Slow and How to Fix Them

How Databases Find Your Data

B-Tree Indexes: The Workhorse

What B-Trees Are Great At

Hash Indexes: The Specialist

Composite Indexes: Column Order Is Everything

Ordering Strategy

Covering Indexes: Skip the Table Entirely

Partial Indexes: Index Only What Matters

Index Bloat and Maintenance

Common Anti-Patterns

1. Functions on Indexed Columns

2. Indexing Low-Cardinality Columns

3. Over-Indexing

4. Implicit Type Casting

EXPLAIN ANALYZE: Reading the Execution Plan

Index Type Cheat Sheet

Key Takeaways

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

How Databases Find Your Data

B-Tree Indexes: The Workhorse

What B-Trees Are Great At

Hash Indexes: The Specialist

Composite Indexes: Column Order Is Everything

Ordering Strategy

Covering Indexes: Skip the Table Entirely

Partial Indexes: Index Only What Matters

Index Bloat and Maintenance

Common Anti-Patterns

1. Functions on Indexed Columns

2. Indexing Low-Cardinality Columns

3. Over-Indexing

4. Implicit Type Casting

EXPLAIN ANALYZE: Reading the Execution Plan

Index Type Cheat Sheet

Key Takeaways

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Continue Reading

OAuth2 Private Key JWT: Build Client Authentication Without Shared Secrets

Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory