You wrote a Python script that processes 10,000 files, but it takes 30 minutes because it handles them one by one. You've heard about "threading" and "multiprocessing" but you're not sure which to use — or what the difference even is. This guide explains Python's concurrency models from the ground up, with diagrams and real code you can run.
First: What Does "Concurrency" Mean?
Imagine a restaurant kitchen. Sequential processing means one chef does everything — chops vegetables, then cooks meat, then plates the dish. Concurrency means multiple tasks make progress at the same time. But there are two ways to achieve this:
The GIL: Python's Biggest Gotcha
Before we dive into code, you need to understand the Global Interpreter Lock (GIL). It's the single most important concept for Python concurrency.
The GIL is a mutex (lock) in CPython that allows only one thread to execute Python bytecode at a time. Even if you create 10 threads on a machine with 10 CPU cores, only one thread runs Python code at any given moment.
Why does the GIL exist? It simplifies CPython's memory management. Python objects use reference counting for garbage collection, and the GIL prevents race conditions on reference counts. Without it, every object access would need its own lock — much slower.
Key insight: The GIL only blocks CPU-bound work. When a thread does I/O (network request, file read, database query), it releases the GIL while waiting. This is why threading works great for I/O but not for computation.
Threading: Perfect for I/O-Bound Work
Use threading when your program spends most of its time waiting — for network responses, file I/O, database queries, or API calls.
import threading
import time
import requests
# ── Sequential (SLOW) ──────────────────────────
def fetch_url(url):
response = requests.get(url)
return f"{url}: {response.status_code}"
urls = [
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
]
# Sequential: each request waits for the previous one
start = time.time()
for url in urls:
print(fetch_url(url))
print(f"Sequential: {time.time() - start:.1f}s")
# Output: ~5.0s (1 second per request x 5)
# ── Threaded (FAST) ───────────────────────────
results = []
def fetch_and_store(url):
result = fetch_url(url)
results.append(result)
start = time.time()
threads = []
for url in urls:
t = threading.Thread(target=fetch_and_store, args=(url,))
threads.append(t)
t.start()
# Wait for all threads to finish
for t in threads:
t.join()
for r in results:
print(r)
print(f"Threaded: {time.time() - start:.1f}s")
# Output: ~1.1s (all 5 requests run simultaneously!)
# That's a 5x speedup — because threads release the GIL during I/O
ThreadPoolExecutor: The Modern Way
Instead of manually creating threads, use concurrent.futures.ThreadPoolExecutor — it manages a pool of reusable threads and returns results cleanly:
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
import time
def fetch_url(url):
"""Fetch a URL and return the status code."""
response = requests.get(url, timeout=10)
return {"url": url, "status": response.status_code, "size": len(response.content)}
urls = [f"https://httpbin.org/delay/{i % 3}" for i in range(10)]
# ── ThreadPoolExecutor ─────────────────────────
start = time.time()
with ThreadPoolExecutor(max_workers=5) as executor:
# Submit all tasks
future_to_url = {executor.submit(fetch_url, url): url for url in urls}
# Collect results as they complete (not in submission order!)
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
result = future.result()
print(f" {result['url']}: {result['status']} ({result['size']} bytes)")
except Exception as e:
print(f" {url}: ERROR - {e}")
print(f"\nCompleted in {time.time() - start:.1f}s")
# 10 URLs with max 2s delay each, 5 workers = ~4s total (not 15s!)
Multiprocessing: True Parallelism for CPU Work
When your program is CPU-bound (number crunching, image processing, data transformation), threads won't help because of the GIL. Instead, use multiprocessing — it spawns separate Python processes, each with its own GIL and its own CPU core.
import multiprocessing
import time
import math
# ── CPU-heavy function ─────────────────────────
def is_prime(n):
"""Check if a number is prime (CPU-intensive for large numbers)."""
if n < 2:
return False
for i in range(2, int(math.sqrt(n)) + 1):
if n % i == 0:
return False
return True
def count_primes(start, end):
"""Count primes in a range."""
count = sum(1 for n in range(start, end) if is_prime(n))
return count
RANGE_END = 500_000
# ── Sequential ─────────────────────────────────
start = time.time()
result = count_primes(0, RANGE_END)
print(f"Sequential: {result} primes in {time.time() - start:.1f}s")
# Output: 41538 primes in ~3.5s
# ── Threaded (NO improvement for CPU work!) ────
start = time.time()
with ThreadPoolExecutor(max_workers=4) as executor:
chunk_size = RANGE_END // 4
futures = [
executor.submit(count_primes, i * chunk_size, (i + 1) * chunk_size)
for i in range(4)
]
result = sum(f.result() for f in futures)
print(f"Threaded (4 threads): {result} primes in {time.time() - start:.1f}s")
# Output: 41538 primes in ~3.8s (SLOWER! GIL prevents parallelism)
# ── Multiprocessing (REAL speedup!) ────────────
start = time.time()
with multiprocessing.Pool(processes=4) as pool:
chunk_size = RANGE_END // 4
chunks = [(i * chunk_size, (i + 1) * chunk_size) for i in range(4)]
results = pool.starmap(count_primes, chunks)
result = sum(results)
print(f"Multiprocessing (4 processes): {result} primes in {time.time() - start:.1f}s")
# Output: 41538 primes in ~1.0s (3.5x speedup on 4 cores!)
Notice: Threading is actually slower than sequential for CPU work! The GIL means threads take turns, plus there's overhead from context switching. Multiprocessing gives a near-linear speedup because each process has its own GIL on its own CPU core.
ProcessPoolExecutor: The Clean Way
from concurrent.futures import ProcessPoolExecutor
import time
def heavy_computation(n):
"""Simulate CPU-intensive work."""
total = 0
for i in range(n):
total += i ** 2
return total
numbers = [10_000_000] * 8 # 8 heavy tasks
# ── ProcessPoolExecutor ────────────────────────
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(heavy_computation, numbers))
print(f"ProcessPool: {time.time() - start:.1f}s")
# Compare with sequential:
start = time.time()
results = [heavy_computation(n) for n in numbers]
print(f"Sequential: {time.time() - start:.1f}s")
# ProcessPool is ~3-4x faster on a 4-core machine
asyncio: The Third Option
asyncio is Python's built-in async/await framework. Like threading, it's for I/O-bound work — but instead of creating OS threads, it uses a single-threaded event loop with cooperative multitasking. It's lighter than threading and scales to thousands of concurrent connections.
import asyncio
import aiohttp
import time
async def fetch_url(session, url):
"""Fetch a URL asynchronously."""
async with session.get(url) as response:
content = await response.read()
return {"url": url, "status": response.status, "size": len(content)}
async def main():
urls = [f"https://httpbin.org/delay/{i % 3}" for i in range(10)]
async with aiohttp.ClientSession() as session:
# Launch ALL requests concurrently
tasks = [fetch_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
for r in results:
print(f" {r['url']}: {r['status']} ({r['size']} bytes)")
start = time.time()
asyncio.run(main())
print(f"\nasyncio: {time.time() - start:.1f}s")
# Same speed as threading (~2s), but uses only 1 thread!
# Can handle 10,000+ concurrent connections efficiently
Real-World Example: Image Processing Pipeline
Let's build a practical pipeline that downloads images (I/O-bound) and resizes them (CPU-bound) using the right tool for each:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from PIL import Image
import requests
import io
import time
def download_image(url):
"""Download an image (I/O-bound — use threads)."""
response = requests.get(url, timeout=10)
return response.content
def resize_image(image_bytes):
"""Resize an image to 300x300 (CPU-bound — use processes)."""
img = Image.open(io.BytesIO(image_bytes))
img = img.resize((300, 300), Image.LANCZOS)
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return buffer.getvalue()
# Sample image URLs
image_urls = [
"https://picsum.photos/2000/2000", # Random 2000x2000 images
] * 20 # 20 images
# Step 1: Download all images using THREADS (I/O-bound)
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
raw_images = list(executor.map(download_image, image_urls))
print(f"Downloaded {len(raw_images)} images in {time.time() - start:.1f}s")
# Step 2: Resize all images using PROCESSES (CPU-bound)
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
resized = list(executor.map(resize_image, raw_images))
print(f"Resized {len(resized)} images in {time.time() - start:.1f}s")
# The RIGHT tool for each job:
# - Threads for downloading (waiting for network)
# - Processes for resizing (CPU-intensive pixel manipulation)
Thread Safety: Race Conditions and Locks
When multiple threads share data, you can get race conditions — bugs where the result depends on which thread runs first:
import threading
# ── BROKEN: Race condition ─────────────────────
counter = 0
def increment():
global counter
for _ in range(100_000):
counter += 1 # NOT atomic! Read + Modify + Write
threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Expected: 500,000")
print(f"Actual: {counter}") # Something like 387,421 (WRONG!)
# Why? Two threads read counter=100, both write 101, losing one increment
# ── FIXED: Using a Lock ───────────────────────
counter = 0
lock = threading.Lock()
def safe_increment():
global counter
for _ in range(100_000):
with lock: # Only one thread can be inside this block at a time
counter += 1
threads = [threading.Thread(target=safe_increment) for _ in range(5)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Expected: 500,000")
print(f"Actual: {counter}") # Exactly 500,000 ✅
# ── BEST: Use thread-safe data structures ──────
from queue import Queue
from collections import Counter
# Queue is thread-safe by default — no locks needed
task_queue = Queue()
for i in range(1000):
task_queue.put(i)
results = []
results_lock = threading.Lock()
def worker():
while not task_queue.empty():
try:
item = task_queue.get_nowait()
result = item ** 2 # Process the item
with results_lock:
results.append(result)
except:
break
threads = [threading.Thread(target=worker) for _ in range(4)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Processed {len(results)} items") # 1000 ✅
Sharing Data Between Processes
Processes have isolated memory — they can't share variables like threads can. Use these mechanisms to communicate:
import multiprocessing
# ── Method 1: Shared Value ─────────────────────
counter = multiprocessing.Value('i', 0) # 'i' = integer
lock = multiprocessing.Lock()
def increment_shared(counter, lock):
for _ in range(100_000):
with lock:
counter.value += 1
processes = [
multiprocessing.Process(target=increment_shared, args=(counter, lock))
for _ in range(4)
]
for p in processes:
p.start()
for p in processes:
p.join()
print(f"Shared counter: {counter.value}") # 400,000 ✅
# ── Method 2: Queue (producer-consumer) ────────
def producer(queue):
for i in range(100):
queue.put(f"item-{i}")
queue.put(None) # Poison pill = "stop"
def consumer(queue, results):
while True:
item = queue.get()
if item is None:
break
results.append(item.upper())
queue = multiprocessing.Queue()
manager = multiprocessing.Manager()
results = manager.list() # Shared list across processes
p1 = multiprocessing.Process(target=producer, args=(queue,))
p2 = multiprocessing.Process(target=consumer, args=(queue, results))
p1.start()
p2.start()
p1.join()
p2.join()
print(f"Processed: {len(results)} items") # 100
# ── Method 3: Pool.map (simplest for batch work) ──
with multiprocessing.Pool(4) as pool:
results = pool.map(str.upper, ["hello", "world", "python"])
print(results) # ['HELLO', 'WORLD', 'PYTHON']
The Complete Decision Guide
| Feature | Threading | Multiprocessing | asyncio |
|---|---|---|---|
| Best for | I/O-bound | CPU-bound | I/O-bound (high concurrency) |
| GIL impact | Blocked for CPU work | Bypassed (separate GILs) | Same as threading |
| Memory | Shared (lightweight) | Isolated (heavy) | Shared (lightest) |
| Overhead | Low | High (process spawn) | Lowest |
| Max concurrent | ~100-1000 threads | = CPU cores | 10,000+ tasks |
| Data sharing | Shared (need locks) | Queue / Pipe / Manager | Shared (single thread) |
| Learning curve | Easy | Medium | Medium (async/await) |
| Use when | API calls, file I/O, web scraping | Math, image processing, ML | Web servers, chat, 1000s of connections |
Common Mistakes Beginners Make
- Using threading for CPU work: The GIL means threads take turns for CPU tasks. Use
multiprocessinginstead. - Creating too many processes: Each process copies your entire program's memory. 100 processes on a 4-core machine wastes RAM and adds overhead. Match
max_workersto your CPU core count. - Forgetting to join threads/processes: Always call
.join()or use a context manager (with) to wait for completion. Otherwise your program may exit before workers finish. - Sharing mutable state without locks: If two threads modify the same variable, you'll get race conditions. Use
threading.Lock()or thread-safe structures likeQueue. - Not handling exceptions in workers: Exceptions in threads/processes are swallowed silently unless you check
future.result()or wrap in try/except. - Using
multiprocessingfor I/O: It works, but you're paying process spawn overhead for no benefit. Use threads or asyncio for I/O.
Quick Reference
# ── I/O-bound: Use ThreadPoolExecutor ──────────
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(fetch_url, urls))
# ── CPU-bound: Use ProcessPoolExecutor ─────────
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as pool:
results = list(pool.map(heavy_computation, data))
# ── High-concurrency I/O: Use asyncio ──────────
import asyncio
async def main():
results = await asyncio.gather(*[fetch(url) for url in urls])
asyncio.run(main())
# ── Mixed workload: Combine both ───────────────
# Step 1: ThreadPool for I/O (download files)
# Step 2: ProcessPool for CPU (process files)
# This is the most common real-world pattern!
Python's concurrency story is simpler than it looks: threads for waiting, processes for computing, asyncio for massive I/O scale. The GIL is not a bug — it's a design choice that makes single-threaded Python fast and safe. Once you understand it, choosing the right tool becomes second nature. Start with ThreadPoolExecutor and ProcessPoolExecutor — they handle 95% of real-world concurrency needs with clean, readable code.