Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory

Spark jobs usually slow down for predictable reasons: too much shuffle, skewed keys, bad partition sizing, expensive file layouts, and memory pressure. Learn how to debug each one.

Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory illustration
On this page18 sections

Spark jobs usually become slow for boring reasons. A job shuffles too much data. One partition gets most of the work. Files are too small. Executors spill to disk. A join chooses the wrong strategy. The cluster is not always the first problem; the physical plan usually is.

This guide gives you a practical mental model for debugging slow Spark jobs without guessing.

First Check: Is the Job CPU, IO, Shuffle, or Memory Bound?

Open the Spark UI and look at stages. Slow Spark jobs usually show one of these patterns:

  • Many tasks, all similarly slow: likely IO, expensive parsing, or heavy computation.
  • A few tasks much slower than the rest: likely data skew.
  • Large shuffle read/write: join, groupBy, distinct, repartition, or window operation is moving data.
  • High spill to disk: executor memory pressure or aggregation/join state is too large.
  • Too many tiny tasks: small files or over-partitioning.

Shuffle: The Expensive Data Movement

A shuffle happens when Spark must move rows across the cluster so rows with the same key land together. Joins, aggregations, repartitions, window functions, and distinct operations often trigger shuffles.

-- This usually shuffles orders by customer_id
SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id;

Shuffles are not bad by themselves. They become a problem when the data moved is huge, the partition count is wrong, or the same shuffle happens repeatedly because intermediate results are not persisted or written in a usable layout.

Skew: One Key Does Most of the Work

Skew means the data is not evenly distributed. One customer, one country, one null key, or one event type may have far more rows than the rest. Spark runs many tasks in parallel, but the whole stage waits for the slowest task.

Common signs:

  • One task processes gigabytes while others process megabytes.
  • A join stage sits at 99% for a long time.
  • Most executors are idle while one task keeps running.

Fixes depend on the query. You can filter bad null keys, broadcast a small side, split hot keys, salt keys, pre-aggregate, or let Adaptive Query Execution handle skewed joins when it applies.

Partitions: Too Few, Too Many, or Badly Sized

Partitions are Spark's unit of parallel work. Too few partitions underuse the cluster. Too many partitions create scheduler overhead and tiny output files. Bad partitions make some tasks huge and others empty.

# Useful inspection pattern
df.rdd.getNumPartitions()

# Repartition by a high-cardinality key before a heavy keyed operation
df = df.repartition(400, "customer_id")

# Coalesce down before writing small outputs
df.coalesce(32).write.mode("overwrite").parquet(output_path)

There is no universal partition count. Use data size, file size targets, cluster cores, and shuffle volume. For SQL workloads, Spark's adaptive execution can coalesce post-shuffle partitions when enabled.

Memory: When Spark Spills to Disk

Spark uses memory for execution state, joins, aggregations, sorting, and caching. When memory is insufficient, Spark spills to disk. Some spilling is normal. Heavy spill usually means the physical plan is holding too much state per task.

Useful fixes include reducing columns early, filtering before joins, using broadcast joins for genuinely small tables, avoiding unnecessary cache, improving partitioning, and rewriting huge aggregations into staged aggregations.

File Layout: The Hidden Spark Tax

Even a good query is slow if the table layout is bad. Thousands of tiny files increase planning and task overhead. Oversized files reduce parallelism. Missing partition pruning causes full scans. Poor table statistics can lead to bad join choices.

For lakehouse tables, compaction, clustering, partition evolution, statistics, and table maintenance matter as much as Spark configuration.

Debugging Checklist

  • Read the physical plan before changing cluster size.
  • Check shuffle read/write size per stage.
  • Look for skewed tasks and long tails.
  • Check input file count and average file size.
  • Confirm partition pruning is actually happening.
  • Review spills, executor lost events, GC time, and failed tasks.
  • Enable or review Adaptive Query Execution settings where supported.
  • Write intermediate data in a layout that matches downstream access.

Diagnostic Flow Diagram

Slow Spark jobs are easier to debug if you follow the symptom to the likely physical cause. The flow below is intentionally simple. It keeps you from changing ten settings at once and losing the signal.

Read the Physical Plan Like a Debugger

The logical query is what you asked for. The physical plan is how Spark intends to do it. A slow query usually becomes obvious when you look for exchanges, sorts, broadcast decisions, full scans, and repeated expensive operations. In Spark SQL, the word Exchange usually means data movement. Large exchanges deserve attention.

-- SQL inspection pattern
EXPLAIN FORMATTED
SELECT customer_id, SUM(order_total)
FROM orders
WHERE order_date >= DATE '2026-01-01'
GROUP BY customer_id;

-- PySpark inspection pattern
df.explain("formatted")

Do not treat every exchange as a bug. A group by needs rows with the same key together. A join may need data movement. The question is whether the shuffle is expected, whether it is sized correctly, and whether the same data movement could be avoided through filtering, pre-aggregation, partition pruning, broadcast joins, or better table layout.

Skew Debugging: Find the Hot Key First

Skew fixes are dangerous when you apply them blindly. Salting every key, repartitioning every table, or increasing shuffle partitions can make the job more expensive without removing the long tail. Start by finding the hot key or the hot partition. Nulls, default values, bot traffic, one huge customer, and one country can all create a single partition that does most of the work.

# Find suspiciously hot join keys before the join
from pyspark.sql import functions as F

orders.groupBy("customer_id").count()   .orderBy(F.desc("count"))   .show(20, truncate=False)

If the hot key is invalid, filter it or handle it separately. If the hot key is legitimate, split that part of the workload. For joins, try broadcasting the genuinely small side, pre-aggregating before the join, or salting only the hot keys. Adaptive Query Execution can help with skewed joins in supported cases, but it is not a substitute for understanding the data distribution.

Partition Sizing Rules That Survive Production

Partition advice is often repeated as a magic number. In practice, the right partition count depends on data size, cluster cores, compression, file format, operation type, and downstream write target. A better rule is to make partitions large enough to avoid scheduler overhead and small enough to keep tasks balanced and memory-safe.

Symptom Likely partition issue Practical response
Cluster underused Too few partitions or one huge partition. Increase parallelism or repartition by a useful key before the heavy stage.
Scheduler overhead high Too many tiny tasks. Coalesce after shuffle where safe, compact small files, avoid needless repartition.
Many tiny output files Write partitioning does not match output size. Coalesce or tune write distribution and run table compaction.
Stage stuck at 99% Skewed partition. Inspect task input size and hot keys before changing cluster size.

Memory Pressure Is Often a Query Shape Problem

Adding executor memory is sometimes correct, but it should not be the first reflex. Memory pressure often means the query is holding too much state per task: a huge aggregation group, a sort over too many columns, an accidental cross join, a cache of a wide dataframe, or a join strategy that builds a large hash table.

Before increasing memory, reduce the width of the data. Select only needed columns. Filter early. Avoid caching wide intermediate data unless multiple downstream steps reuse it. Replace repeated transformations with a persisted table when the same expensive stage is reused by many jobs. Look at spill metrics after each change. If the spill disappears and the stage time drops, you fixed the shape of the work rather than merely buying more memory.

Production Runbook for a Slow Job

When a Spark job suddenly gets slower, avoid starting with code changes. First decide whether the data changed, the cluster changed, the engine changed, or the query plan changed. A new partition, a new source file pattern, a changed join key, a larger dimension table, or a disabled table maintenance job can all make yesterday's acceptable query slow today.

A good runbook captures comparable evidence. Save the Spark application ID, input data range, table snapshot or version where available, cluster shape, Spark version, key configuration, physical plan, stage metrics, and output row count. That record lets another engineer reproduce the investigation instead of relying on memory.

slow_spark_runbook:
  capture:
    - application_id
    - input_tables_and_versions
    - cluster_cores_and_memory
    - spark_sql_shuffle_partitions
    - physical_plan
    - slowest_stage_metrics
    - output_row_count
  compare:
    - "last successful run with similar data"
    - "shuffle bytes and spill bytes"
    - "max task time versus median task time"
    - "input file count and average file size"

Join Strategy Mistakes

Many slow Spark jobs are join problems. A small table that used to broadcast becomes too large and starts shuffling. A large table joins on a low-cardinality key and creates skew. A filter is applied after the join instead of before it. A join includes unnecessary columns, increasing shuffle size. These mistakes are not fixed by more executors as cleanly as they are fixed by changing the plan.

Before tuning, ask whether the join can be made smaller. Filter both sides before the join. Select only needed columns. Check whether the dimension table is genuinely small enough to broadcast. Confirm that the join key is not mostly null. For slowly changing dimensions, verify that the effective-date condition is not turning a simple lookup into a huge range join.

# Reduce data before the join
orders_small = orders.select("order_id", "customer_id", "order_total")
customers_small = customers.select("customer_id", "region")

result = orders_small.join(customers_small, "customer_id")

File Layout and Table Maintenance

Spark is often blamed for table layout problems. A table with tens of thousands of tiny files creates too many tasks and expensive planning. A table with huge files can reduce parallelism. A partition strategy based on a high-cardinality column can create endless small directories. A missing compaction job can turn normal streaming ingestion into a slow batch query weeks later.

For lakehouse tables, maintenance is part of performance engineering. Compact small files, keep statistics current, remove obsolete snapshots according to policy, and avoid partition columns that create unbounded cardinality. Use partitioning for pruning, not as a substitute for indexing. If most queries filter by date and region, design around that access pattern instead of copying the source-system folder layout.

What Not to Do

  • Do not raise executor memory before checking whether the job is spilling because of a bad join or aggregation shape.
  • Do not increase shuffle partitions globally because one query is slow; that can make other workloads worse.
  • Do not cache every intermediate dataframe. Cache only when reuse is real and the cached data fits safely.
  • Do not repartition by a low-cardinality key and expect balanced tasks.
  • Do not assume Adaptive Query Execution removes the need for table statistics and good query shape.
  • Do not compare benchmark runs unless the data volume, layout, cluster, and configuration are comparable.

The best Spark teams treat performance as an observability problem. They keep historical run metrics, review query plans during code review for critical pipelines, and tie table maintenance failures to data freshness alerts. That is how performance stays stable after the first optimization sprint ends.

Sources and Further Reading

Share this article

Stuck on implementation?

Get private, 1-on-1 help with system design, performance, scaling, or any technical challenge.

Book a Session

Related Production Resources

Course

Free learning tracks

Turn this guide into a structured production engineering path.

Lab

Interactive engineering labs

Practice the same ideas through scenario-based simulators.

Reference

Production cheatsheets

Keep the operational commands and checks nearby.

Glossary

Key terms

Review the vocabulary behind the architecture.

Discussion

Questions, corrections, or production notes? Add them here so other learners can benefit.

Continue Reading

Related practical guides from the same production engineering path.

DevOps 8 min read

Modern Data Platforms Compared: Snowflake, Databricks, BigQuery, and e6data

Compare Snowflake, Databricks, BigQuery, and e6data through the production decisions that matter: storage, compute, governance, table formats, cost control, and workload fit.

Data Engineering Snowflake
Tutorials 9 min read

Bronze, Silver, and Gold Data Layers Explained

Learn how bronze, silver, and gold layers organize raw events, cleaned facts, and business-ready datasets without turning your lakehouse into a pile of duplicated tables.

Data Engineering Lakehouse