Database Indexing Secrets: Why Your Queries Are Slow and How to Fix Them

Your database has indexes but queries are still slow. Learn how B-tree internals, composite index ordering, covering indexes, and EXPLAIN ANALYZE can transform query performance from seconds to milliseconds.

Database Indexing Secrets: Why Your Queries Are Slow and How to Fix Them illustration
On this page17 sections

You added an index. The query is still slow. Sound familiar? Most developers treat indexes like magic — add one and hope for the best. But indexes are data structures with specific rules, and violating those rules means your “indexed” query is still doing a full table scan.

This guide covers what your senior dev never explained: how indexes actually work internally, why column order in composite indexes matters more than you think, and how to read EXPLAIN ANALYZE output like a database engineer.

How Databases Find Your Data

When you run a query, the database has two fundamental strategies:

  • Sequential Scan: Read every single row in the table and check if it matches your WHERE clause. Simple, but devastating on large tables.
  • Index Scan: Use a pre-built data structure to jump directly to matching rows. Like using a book index instead of reading every page.

Here is the difference in practice:

-- Table: orders (10 million rows)
-- No index on customer_id

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

-- Seq Scan on orders  (cost=0.00..185432.00 rows=52 width=96)
--   Filter: (customer_id = 42)
--   Rows Removed by Filter: 9999948
--   Planning Time: 0.085 ms
--   Execution Time: 1247.531 ms   ← Over 1 second!

-- Now with an index:
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

-- Index Scan using idx_orders_customer_id on orders
--   Index Cond: (customer_id = 42)
--   Planning Time: 0.092 ms
--   Execution Time: 0.128 ms      ← 10,000x faster!

The sequential scan examined 10 million rows to find 52 matches. The index scan went straight to those 52 rows. That is the power of proper indexing.

B-Tree Indexes: The Workhorse

B-tree (balanced tree) is the default index type in PostgreSQL, MySQL, and virtually every relational database. Understanding its structure explains most indexing behavior.

A B-tree is a sorted, self-balancing tree structure:

          [50, 100]              ← Root node
         /    |     \
   [20,35] [60,80] [120,150]    ← Internal nodes
   /  |  \   / | \   /  |  \
  ... ... ... ... ... ... ...   ← Leaf nodes (actual row pointers)
  • Root and internal nodes act as signposts, directing the search left or right
  • Leaf nodes contain the indexed values and pointers (ctid) to the actual table rows
  • All leaf nodes are linked in a doubly-linked list for efficient range scans
  • Tree depth is typically 3-4 levels even for millions of rows

This means finding any single value requires only 3-4 disk reads, regardless of table size. A table with 100 rows and a table with 100 million rows both need roughly the same number of index lookups.

What B-Trees Are Great At

-- Equality: direct lookup
WHERE email = 'user@example.com'

-- Range: walk the linked leaf nodes
WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31'

-- Sorting: leaf nodes are already sorted
ORDER BY created_at DESC LIMIT 20

-- Prefix matching
WHERE name LIKE 'John%'    -- Uses index
WHERE name LIKE '%John'    -- Cannot use index (no prefix)

Hash Indexes: The Specialist

Hash indexes use a hash function to map values to buckets. They are faster than B-trees for exact equality lookups but useless for everything else.

CREATE INDEX idx_users_email_hash ON users USING hash(email);

-- Uses the hash index (equality only):
WHERE email = 'user@example.com'     -- Yes

-- Cannot use hash index:
WHERE email LIKE 'user%'             -- No (pattern)
WHERE email > 'a' AND email < 'm'   -- No (range)
ORDER BY email                       -- No (sorting)

When to use hash indexes: Only when you exclusively do exact-match lookups on high-cardinality columns (like UUIDs or email addresses) and never need range queries or sorting. In practice, B-tree is almost always the better choice because the performance difference is marginal and B-trees are far more versatile.

Composite Indexes: Column Order Is Everything

A composite index indexes multiple columns together. The column order determines which queries can use the index. This is the leftmost prefix rule — the most misunderstood concept in database indexing.

CREATE INDEX idx_orders_composite ON orders(customer_id, status, created_at);

This single index can satisfy these queries:

-- Uses index (all three columns, left to right):
WHERE customer_id = 42 AND status = 'shipped' AND created_at > '2026-01-01'

-- Uses index (first two columns):
WHERE customer_id = 42 AND status = 'shipped'

-- Uses index (first column only):
WHERE customer_id = 42

-- CANNOT use index (skips first column):
WHERE status = 'shipped'                        -- Skips customer_id!

-- CANNOT use index (skips middle column):
WHERE customer_id = 42 AND created_at > '2026-01-01'  -- Only uses customer_id part

Think of it like a phone book sorted by last name, then first name, then city. You can look up everyone named “Smith” (last name). You can look up “Smith, John” (last + first). But you cannot efficiently look up everyone named “John” without a last name — the book is not sorted that way.

Ordering Strategy

Put columns in this order:

  1. Equality conditions first (WHERE status = 'active')
  2. Range conditions last (WHERE created_at > '2026-01-01')
  3. High-selectivity columns before low-selectivity (customer_id before status)

Covering Indexes: Skip the Table Entirely

Normally, an index scan finds the matching row pointers, then fetches the actual rows from the table (a “heap fetch”). A covering index includes all the columns your query needs, eliminating the heap fetch entirely.

-- Query that needs customer_id, status, and total_amount:
SELECT customer_id, status, total_amount
FROM orders
WHERE customer_id = 42 AND status = 'shipped';

-- Regular index: finds rows, then fetches total_amount from table
CREATE INDEX idx_orders_cust_status ON orders(customer_id, status);

-- Covering index: includes total_amount, no table fetch needed
CREATE INDEX idx_orders_covering ON orders(customer_id, status)
  INCLUDE (total_amount);

In EXPLAIN output, a covering index shows “Index Only Scan” instead of “Index Scan” — this is significantly faster because it avoids random I/O to the heap.

EXPLAIN ANALYZE SELECT customer_id, status, total_amount
FROM orders WHERE customer_id = 42 AND status = 'shipped';

-- Index Only Scan using idx_orders_covering on orders
--   Index Cond: ((customer_id = 42) AND (status = 'shipped'))
--   Heap Fetches: 0            ← Zero table access!
--   Execution Time: 0.045 ms

Partial Indexes: Index Only What Matters

Why index 10 million rows when your query only ever looks at 50,000 of them? A partial index includes a WHERE clause that limits which rows are indexed.

-- Only 2% of orders are 'pending', but you query them constantly
CREATE INDEX idx_orders_pending ON orders(created_at)
  WHERE status = 'pending';

-- This index is 50x smaller than a full index
-- Faster to scan, faster to maintain, less disk space

-- Uses the partial index:
SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;

-- Cannot use it:
SELECT * FROM orders WHERE status = 'shipped' ORDER BY created_at;

Partial indexes are perfect for: active/inactive flags, unprocessed queues, soft-deleted records, and any column where you only query a small subset of values.

Index Bloat and Maintenance

PostgreSQL uses MVCC (Multi-Version Concurrency Control), which means UPDATE and DELETE operations leave dead tuples in indexes. Over time, indexes bloat — they grow larger without holding more useful data.

-- Check index bloat using pg_stat_user_indexes:
SELECT
  schemaname || '.' || relname AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS times_used,
  idx_tup_read AS rows_read
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;

If an index is large but idx_scan is zero, you are paying storage and write overhead for an index nobody uses. Drop it.

-- Rebuild a bloated index (locks the table briefly):
REINDEX INDEX idx_orders_customer_id;

-- Non-blocking rebuild (PostgreSQL 12+):
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;

Common Anti-Patterns

1. Functions on Indexed Columns

-- This CANNOT use an index on created_at:
WHERE YEAR(created_at) = 2026
WHERE LOWER(email) = 'user@example.com'
WHERE amount * 1.1 > 100

-- Fix: rewrite to keep the column bare
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'

-- Or create an expression index:
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

2. Indexing Low-Cardinality Columns

-- Bad: boolean column with only 2 possible values
CREATE INDEX idx_users_active ON users(is_active);
-- The index doesn't help much: 50% of rows match each value
-- The planner will choose a sequential scan anyway

-- Better: partial index if you only query one side
CREATE INDEX idx_users_active ON users(created_at) WHERE is_active = true;

3. Over-Indexing

Every index slows down INSERT, UPDATE, and DELETE operations because the database must update every affected index. A table with 15 indexes means every single write operation triggers 15 index updates. Only create indexes that serve actual query patterns.

4. Implicit Type Casting

-- Column user_id is INTEGER, but you pass a string:
WHERE user_id = '42'
-- PostgreSQL may cast and still use the index, but MySQL often will not.
-- Always match types exactly.

EXPLAIN ANALYZE: Reading the Execution Plan

Here is a real optimization walkthrough. Suppose you have this slow query:

EXPLAIN ANALYZE
SELECT o.id, o.total_amount, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > '2026-01-01'
ORDER BY o.created_at DESC
LIMIT 20;

-- BEFORE optimization:
-- Sort  (cost=45123.45..45123.50 rows=20)
--   Sort Key: o.created_at DESC
--   Sort Method: top-N heapsort  Memory: 27kB
--   -> Hash Join  (cost=1234.56..45000.00 rows=48576)
--        Hash Cond: (o.customer_id = c.id)
--        -> Seq Scan on orders o  (cost=0.00..43210.00 rows=48576)
--              Filter: ((status = 'pending') AND (created_at > '2026-01-01'))
--              Rows Removed by Filter: 9951424
--        -> Hash  (cost=1000.00..1000.00 rows=50000)
--              -> Seq Scan on customers c
-- Planning Time: 0.234 ms
-- Execution Time: 892.456 ms

The bottleneck is the Seq Scan on orders — examining 10 million rows. Let us fix it:

-- Create a targeted composite index:
CREATE INDEX idx_orders_status_created ON orders(status, created_at DESC)
  INCLUDE (total_amount, customer_id);

-- AFTER optimization:
-- Limit  (cost=0.56..45.23 rows=20)
--   -> Nested Loop  (cost=0.56..109.45 rows=48576)
--        -> Index Only Scan Backward using idx_orders_status_created on orders o
--              Index Cond: ((status = 'pending') AND (created_at > '2026-01-01'))
--              Heap Fetches: 0
--        -> Index Scan using customers_pkey on customers c
--              Index Cond: (id = o.customer_id)
-- Planning Time: 0.187 ms
-- Execution Time: 0.342 ms

From 892ms to 0.3ms — a 2,600x improvement. The key changes: the composite index matches the WHERE + ORDER BY, and the INCLUDE clause makes it a covering index (Index Only Scan, zero heap fetches).

Index Type Cheat Sheet

Index Type Best For Supports Range Supports Sort Size
B-tree (default) Almost everything Yes Yes Medium
Hash Equality-only lookups No No Small
GIN Full-text search, arrays, JSONB No No Large
GiST Geospatial, range types, nearest-neighbor Yes No Medium
BRIN Large tables with natural ordering (timestamps) Yes No Tiny
Partial Queries targeting a small subset of rows Depends on base type Depends on base type Tiny
Covering Avoiding heap fetches for known queries Yes (B-tree) Yes (B-tree) Larger

Key Takeaways

  • Always check EXPLAIN ANALYZE before and after adding indexes — do not guess
  • Column order in composite indexes matters — equality columns first, range columns last
  • Covering indexes eliminate heap fetches — use INCLUDE for frequently selected columns
  • Partial indexes save space and speed — index only the rows you actually query
  • Audit unused indexes regularly — every index has a write cost
  • Never apply functions to indexed columns in WHERE clauses — use expression indexes instead
  • B-tree is the right choice 95% of the time — only reach for specialized types when you have a specific need

The difference between a junior and senior database engineer is not knowing that indexes exist — it is knowing which index to create, in what order, with which columns included. Master these fundamentals and you will never fear a slow query again.

Share this article

Stuck on implementation?

Get private, 1-on-1 help with system design, performance, scaling, or any technical challenge.

Book a Session

Related Production Resources

Course

Free learning tracks

Turn this guide into a structured production engineering path.

Lab

Interactive engineering labs

Practice the same ideas through scenario-based simulators.

Reference

Production cheatsheets

Keep the operational commands and checks nearby.

Glossary

Key terms

Review the vocabulary behind the architecture.

Discussion

Questions, corrections, or production notes? Add them here so other learners can benefit.

Continue Reading

Related practical guides from the same production engineering path.

DevOps 10 min read

Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory

Spark jobs usually slow down for predictable reasons: too much shuffle, skewed keys, bad partition sizing, expensive file layouts, and memory pressure. Learn how to debug each one.

Spark Data Engineering
Backend 22 min read

Distributed Systems Algorithms: Consensus, Replication, and Coordination at Production Scale

How real distributed systems agree, replicate, and coordinate. Raft and Paxos consensus, leader election in etcd and Kafka, quorum reads in Cassandra, gossip in Redis Cluster, vector clocks, CRDTs, and the consistency models that determine what your system can promise.

Distributed Systems Consensus