You wrote a Python script that processes 10,000 files, but it takes 30 minutes because it handles them one by one. You've heard about "threading" and "multiprocessing" but you're not sure which to use — or what the difference even is. This guide explains Python's concurrency models from the ground up, with diagrams and real code you can run.

First: What Does "Concurrency" Mean?

Imagine a restaurant kitchen. Sequential processing means one chef does everything — chops vegetables, then cooks meat, then plates the dish. Concurrency means multiple tasks make progress at the same time. But there are two ways to achieve this:

Concurrency vs Parallelism
🔄 Concurrency (Threading)
👨‍🍳One chef, switching tasks rapidly
Chop a bit, stir the pot, chop more
💻One CPU core, time-slicing
🎯Best for: waiting tasks (I/O)
VS
⚡ Parallelism (Multiprocessing)
👨‍🍳👨‍🍳Multiple chefs, working simultaneously
Each chef handles a full dish
💻Multiple CPU cores, true parallel
🎯Best for: computation tasks (CPU)

The GIL: Python's Biggest Gotcha

Before we dive into code, you need to understand the Global Interpreter Lock (GIL). It's the single most important concept for Python concurrency.

The GIL is a mutex (lock) in CPython that allows only one thread to execute Python bytecode at a time. Even if you create 10 threads on a machine with 10 CPU cores, only one thread runs Python code at any given moment.

How the GIL Works
The GIL (Global Interpreter Lock)Only ONE thread can hold the GIL and execute Python code at a time
Thread 1: Hold GIL → Run code → Release GIL → Wait...Gets the lock, runs for a bit, gives it up
Thread 2: Wait... → Hold GIL → Run code → Release GILWaits its turn, then runs when Thread 1 releases
Thread 3: Wait... → Wait... → Hold GIL → Run codeThreads take turns — no true parallelism for CPU work!

Why does the GIL exist? It simplifies CPython's memory management. Python objects use reference counting for garbage collection, and the GIL prevents race conditions on reference counts. Without it, every object access would need its own lock — much slower.

Key insight: The GIL only blocks CPU-bound work. When a thread does I/O (network request, file read, database query), it releases the GIL while waiting. This is why threading works great for I/O but not for computation.

Threading: Perfect for I/O-Bound Work

Use threading when your program spends most of its time waiting — for network responses, file I/O, database queries, or API calls.

import threading
import time
import requests

# ── Sequential (SLOW) ──────────────────────────
def fetch_url(url):
    response = requests.get(url)
    return f"{url}: {response.status_code}"

urls = [
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
]

# Sequential: each request waits for the previous one
start = time.time()
for url in urls:
    print(fetch_url(url))
print(f"Sequential: {time.time() - start:.1f}s")
# Output: ~5.0s (1 second per request x 5)

# ── Threaded (FAST) ───────────────────────────
results = []

def fetch_and_store(url):
    result = fetch_url(url)
    results.append(result)

start = time.time()
threads = []
for url in urls:
    t = threading.Thread(target=fetch_and_store, args=(url,))
    threads.append(t)
    t.start()

# Wait for all threads to finish
for t in threads:
    t.join()

for r in results:
    print(r)
print(f"Threaded: {time.time() - start:.1f}s")
# Output: ~1.1s (all 5 requests run simultaneously!)
# That's a 5x speedup — because threads release the GIL during I/O
Why Threading Works for I/O
Thread 1
Thread 2
Thread 3
1 Send HTTP request
⏳ Waiting (GIL released)
2 Send HTTP request
⏳ Still waiting...
⏳ Waiting
3 Send HTTP
All responses arrive ~simultaneously

ThreadPoolExecutor: The Modern Way

Instead of manually creating threads, use concurrent.futures.ThreadPoolExecutor — it manages a pool of reusable threads and returns results cleanly:

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
import time

def fetch_url(url):
    """Fetch a URL and return the status code."""
    response = requests.get(url, timeout=10)
    return {"url": url, "status": response.status_code, "size": len(response.content)}

urls = [f"https://httpbin.org/delay/{i % 3}" for i in range(10)]

# ── ThreadPoolExecutor ─────────────────────────
start = time.time()

with ThreadPoolExecutor(max_workers=5) as executor:
    # Submit all tasks
    future_to_url = {executor.submit(fetch_url, url): url for url in urls}

    # Collect results as they complete (not in submission order!)
    for future in as_completed(future_to_url):
        url = future_to_url[future]
        try:
            result = future.result()
            print(f"  {result['url']}: {result['status']} ({result['size']} bytes)")
        except Exception as e:
            print(f"  {url}: ERROR - {e}")

print(f"\nCompleted in {time.time() - start:.1f}s")
# 10 URLs with max 2s delay each, 5 workers = ~4s total (not 15s!)

Multiprocessing: True Parallelism for CPU Work

When your program is CPU-bound (number crunching, image processing, data transformation), threads won't help because of the GIL. Instead, use multiprocessing — it spawns separate Python processes, each with its own GIL and its own CPU core.

Threading vs Multiprocessing — Under the Hood
🧵 Threading
💻Same process, shared memory
🔒Shared GIL (one thread at a time)
Lightweight (fast to create)
💾Low memory overhead
Race conditions possible
🎯Best for: I/O-bound work
VS
⚙ Multiprocessing
💻Separate processes, isolated memory
🔓Separate GIL per process (true parallel!)
🐢Heavier (slower to create)
💾Higher memory (copies of data)
🛡No race conditions (isolated)
🎯Best for: CPU-bound work
import multiprocessing
import time
import math

# ── CPU-heavy function ─────────────────────────
def is_prime(n):
    """Check if a number is prime (CPU-intensive for large numbers)."""
    if n < 2:
        return False
    for i in range(2, int(math.sqrt(n)) + 1):
        if n % i == 0:
            return False
    return True

def count_primes(start, end):
    """Count primes in a range."""
    count = sum(1 for n in range(start, end) if is_prime(n))
    return count

RANGE_END = 500_000

# ── Sequential ─────────────────────────────────
start = time.time()
result = count_primes(0, RANGE_END)
print(f"Sequential: {result} primes in {time.time() - start:.1f}s")
# Output: 41538 primes in ~3.5s

# ── Threaded (NO improvement for CPU work!) ────
start = time.time()
with ThreadPoolExecutor(max_workers=4) as executor:
    chunk_size = RANGE_END // 4
    futures = [
        executor.submit(count_primes, i * chunk_size, (i + 1) * chunk_size)
        for i in range(4)
    ]
    result = sum(f.result() for f in futures)
print(f"Threaded (4 threads): {result} primes in {time.time() - start:.1f}s")
# Output: 41538 primes in ~3.8s (SLOWER! GIL prevents parallelism)

# ── Multiprocessing (REAL speedup!) ────────────
start = time.time()
with multiprocessing.Pool(processes=4) as pool:
    chunk_size = RANGE_END // 4
    chunks = [(i * chunk_size, (i + 1) * chunk_size) for i in range(4)]
    results = pool.starmap(count_primes, chunks)
    result = sum(results)
print(f"Multiprocessing (4 processes): {result} primes in {time.time() - start:.1f}s")
# Output: 41538 primes in ~1.0s (3.5x speedup on 4 cores!)
CPU-Bound Benchmark: Count Primes to 500K (lower is better)
Sequential
Threading (4)
Multiprocessing (4)

Notice: Threading is actually slower than sequential for CPU work! The GIL means threads take turns, plus there's overhead from context switching. Multiprocessing gives a near-linear speedup because each process has its own GIL on its own CPU core.

ProcessPoolExecutor: The Clean Way

from concurrent.futures import ProcessPoolExecutor
import time

def heavy_computation(n):
    """Simulate CPU-intensive work."""
    total = 0
    for i in range(n):
        total += i ** 2
    return total

numbers = [10_000_000] * 8  # 8 heavy tasks

# ── ProcessPoolExecutor ────────────────────────
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(heavy_computation, numbers))
print(f"ProcessPool: {time.time() - start:.1f}s")

# Compare with sequential:
start = time.time()
results = [heavy_computation(n) for n in numbers]
print(f"Sequential: {time.time() - start:.1f}s")
# ProcessPool is ~3-4x faster on a 4-core machine

asyncio: The Third Option

asyncio is Python's built-in async/await framework. Like threading, it's for I/O-bound work — but instead of creating OS threads, it uses a single-threaded event loop with cooperative multitasking. It's lighter than threading and scales to thousands of concurrent connections.

import asyncio
import aiohttp
import time

async def fetch_url(session, url):
    """Fetch a URL asynchronously."""
    async with session.get(url) as response:
        content = await response.read()
        return {"url": url, "status": response.status, "size": len(content)}

async def main():
    urls = [f"https://httpbin.org/delay/{i % 3}" for i in range(10)]

    async with aiohttp.ClientSession() as session:
        # Launch ALL requests concurrently
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    for r in results:
        print(f"  {r['url']}: {r['status']} ({r['size']} bytes)")

start = time.time()
asyncio.run(main())
print(f"\nasyncio: {time.time() - start:.1f}s")
# Same speed as threading (~2s), but uses only 1 thread!
# Can handle 10,000+ concurrent connections efficiently

Real-World Example: Image Processing Pipeline

Let's build a practical pipeline that downloads images (I/O-bound) and resizes them (CPU-bound) using the right tool for each:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from PIL import Image
import requests
import io
import time

def download_image(url):
    """Download an image (I/O-bound — use threads)."""
    response = requests.get(url, timeout=10)
    return response.content

def resize_image(image_bytes):
    """Resize an image to 300x300 (CPU-bound — use processes)."""
    img = Image.open(io.BytesIO(image_bytes))
    img = img.resize((300, 300), Image.LANCZOS)
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    return buffer.getvalue()

# Sample image URLs
image_urls = [
    "https://picsum.photos/2000/2000",  # Random 2000x2000 images
] * 20  # 20 images

# Step 1: Download all images using THREADS (I/O-bound)
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
    raw_images = list(executor.map(download_image, image_urls))
print(f"Downloaded {len(raw_images)} images in {time.time() - start:.1f}s")

# Step 2: Resize all images using PROCESSES (CPU-bound)
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
    resized = list(executor.map(resize_image, raw_images))
print(f"Resized {len(resized)} images in {time.time() - start:.1f}s")

# The RIGHT tool for each job:
# - Threads for downloading (waiting for network)
# - Processes for resizing (CPU-intensive pixel manipulation)

Thread Safety: Race Conditions and Locks

When multiple threads share data, you can get race conditions — bugs where the result depends on which thread runs first:

import threading

# ── BROKEN: Race condition ─────────────────────
counter = 0

def increment():
    global counter
    for _ in range(100_000):
        counter += 1  # NOT atomic! Read + Modify + Write

threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Expected: 500,000")
print(f"Actual:   {counter}")  # Something like 387,421 (WRONG!)
# Why? Two threads read counter=100, both write 101, losing one increment

# ── FIXED: Using a Lock ───────────────────────
counter = 0
lock = threading.Lock()

def safe_increment():
    global counter
    for _ in range(100_000):
        with lock:  # Only one thread can be inside this block at a time
            counter += 1

threads = [threading.Thread(target=safe_increment) for _ in range(5)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Expected: 500,000")
print(f"Actual:   {counter}")  # Exactly 500,000 ✅

# ── BEST: Use thread-safe data structures ──────
from queue import Queue
from collections import Counter

# Queue is thread-safe by default — no locks needed
task_queue = Queue()
for i in range(1000):
    task_queue.put(i)

results = []
results_lock = threading.Lock()

def worker():
    while not task_queue.empty():
        try:
            item = task_queue.get_nowait()
            result = item ** 2  # Process the item
            with results_lock:
                results.append(result)
        except:
            break

threads = [threading.Thread(target=worker) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()
print(f"Processed {len(results)} items")  # 1000 ✅

Sharing Data Between Processes

Processes have isolated memory — they can't share variables like threads can. Use these mechanisms to communicate:

import multiprocessing

# ── Method 1: Shared Value ─────────────────────
counter = multiprocessing.Value('i', 0)  # 'i' = integer
lock = multiprocessing.Lock()

def increment_shared(counter, lock):
    for _ in range(100_000):
        with lock:
            counter.value += 1

processes = [
    multiprocessing.Process(target=increment_shared, args=(counter, lock))
    for _ in range(4)
]
for p in processes:
    p.start()
for p in processes:
    p.join()
print(f"Shared counter: {counter.value}")  # 400,000 ✅

# ── Method 2: Queue (producer-consumer) ────────
def producer(queue):
    for i in range(100):
        queue.put(f"item-{i}")
    queue.put(None)  # Poison pill = "stop"

def consumer(queue, results):
    while True:
        item = queue.get()
        if item is None:
            break
        results.append(item.upper())

queue = multiprocessing.Queue()
manager = multiprocessing.Manager()
results = manager.list()  # Shared list across processes

p1 = multiprocessing.Process(target=producer, args=(queue,))
p2 = multiprocessing.Process(target=consumer, args=(queue, results))
p1.start()
p2.start()
p1.join()
p2.join()
print(f"Processed: {len(results)} items")  # 100

# ── Method 3: Pool.map (simplest for batch work) ──
with multiprocessing.Pool(4) as pool:
    results = pool.map(str.upper, ["hello", "world", "python"])
print(results)  # ['HELLO', 'WORLD', 'PYTHON']

The Complete Decision Guide

Which Concurrency Model Should You Use?
What kind of work?
I/O-bound (network, files, DB)?
Threading or asyncioThreads release GIL during I/O
CPU-bound (math, data, images)?
MultiprocessingSeparate GIL per process
Both I/O + CPU mixed?
Threads for I/O + Processes for CPUCombine both!
Threading vs Multiprocessing vs asyncio — Complete Comparison
Feature Threading Multiprocessing asyncio
Best forI/O-boundCPU-boundI/O-bound (high concurrency)
GIL impactBlocked for CPU workBypassed (separate GILs)Same as threading
MemoryShared (lightweight)Isolated (heavy)Shared (lightest)
OverheadLowHigh (process spawn)Lowest
Max concurrent~100-1000 threads= CPU cores10,000+ tasks
Data sharingShared (need locks)Queue / Pipe / ManagerShared (single thread)
Learning curveEasyMediumMedium (async/await)
Use whenAPI calls, file I/O, web scrapingMath, image processing, MLWeb servers, chat, 1000s of connections

Common Mistakes Beginners Make

  • Using threading for CPU work: The GIL means threads take turns for CPU tasks. Use multiprocessing instead.
  • Creating too many processes: Each process copies your entire program's memory. 100 processes on a 4-core machine wastes RAM and adds overhead. Match max_workers to your CPU core count.
  • Forgetting to join threads/processes: Always call .join() or use a context manager (with) to wait for completion. Otherwise your program may exit before workers finish.
  • Sharing mutable state without locks: If two threads modify the same variable, you'll get race conditions. Use threading.Lock() or thread-safe structures like Queue.
  • Not handling exceptions in workers: Exceptions in threads/processes are swallowed silently unless you check future.result() or wrap in try/except.
  • Using multiprocessing for I/O: It works, but you're paying process spawn overhead for no benefit. Use threads or asyncio for I/O.

Quick Reference

# ── I/O-bound: Use ThreadPoolExecutor ──────────
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch_url, urls))

# ── CPU-bound: Use ProcessPoolExecutor ─────────
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(heavy_computation, data))

# ── High-concurrency I/O: Use asyncio ──────────
import asyncio
async def main():
    results = await asyncio.gather(*[fetch(url) for url in urls])
asyncio.run(main())

# ── Mixed workload: Combine both ───────────────
# Step 1: ThreadPool for I/O (download files)
# Step 2: ProcessPool for CPU (process files)
# This is the most common real-world pattern!

Python's concurrency story is simpler than it looks: threads for waiting, processes for computing, asyncio for massive I/O scale. The GIL is not a bug — it's a design choice that makes single-threaded Python fast and safe. Once you understand it, choosing the right tool becomes second nature. Start with ThreadPoolExecutor and ProcessPoolExecutor — they handle 95% of real-world concurrency needs with clean, readable code.