Python performance is not about the language — it is about how you use it. The gap between naive Python and optimized Python is 10–100x, and most of that gap is closeable without leaving Python at all. These are the 20 techniques working engineers use in production.
⚡ TL;DR: Profile first (80% of time is in 20% of code). Fix algorithmic complexity first. Then: use local variables, list comprehensions, built-ins, NumPy for numeric, generators for large data,
__slots__for many objects, and multiprocessing for CPU-bound work.
1. Profile before optimizing
import cProfile, pstats, io
def profile(func):
def wrapper(*args, **kwargs):
pr = cProfile.Profile()
pr.enable()
result = func(*args, **kwargs)
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
ps.print_stats(20) # Top 20 slowest functions
print(s.getvalue())
return result
return wrapper
@profile
def my_function():
pass
2. Local variables beat global lookups
import math
# SLOW: attribute lookup on module every iteration
def slow_sqrt(numbers):
return [math.sqrt(n) for n in numbers]
# FAST: cache attribute lookup — 33% faster
def fast_sqrt(numbers):
local_sqrt = math.sqrt
return [local_sqrt(n) for n in numbers]
3. NumPy vectorization — biggest win for numeric code
import numpy as np
# Python loop: 320ms on 1M elements
def python_loop(data):
return [x**2 + 2*x + 1 for x in data]
# NumPy vectorized: 3ms — 100x faster
def numpy_vectorized(data):
arr = np.array(data)
return arr**2 + 2*arr + 1
# Boolean indexing instead of filtering:
data = np.random.randn(1_000_000)
positive = data[data > 0] # Much faster than list filter
4. __slots__ for memory and speed
import sys
class PointSlotted:
__slots__ = ('x', 'y', 'z')
def __init__(self, x, y, z):
self.x = x; self.y = y; self.z = z
# Regular class: ~360 bytes per instance
# Slotted class: ~64 bytes per instance (5.6x smaller)
# For 1M objects: 360MB vs 64MB
# Attribute access also 30% faster
5. Generators for large datasets
# Pipeline: O(1) memory regardless of data size
def pipeline(source):
cleaned = (clean(row) for row in source)
filtered = (row for row in cleaned if is_valid(row))
transformed = (transform(row) for row in filtered)
return transformed
# Process 100GB file without loading into memory:
for result in pipeline(read_file('huge.csv')):
write_output(result)
6. Multiprocessing for CPU-bound work
from multiprocessing import Pool
import os
def cpu_intensive(n):
return sum(i**2 for i in range(n))
data = [10_000_000] * 8
# Sequential: blocked by GIL
results = [cpu_intensive(n) for n in data]
# Multiprocessing: true parallelism — ~8x speedup on 8-core machine
with Pool(processes=os.cpu_count()) as pool:
results = pool.map(cpu_intensive, data)
7. Use set for membership testing
# x in list: O(n)
# x in set: O(1)
# SLOW: O(n) per check
valid_ids = [1, 2, 3, ..., 100000] # list
for user_id in user_ids:
if user_id in valid_ids: # O(n) every time!
process(user_id)
# FAST: O(1) per check
valid_ids = set([1, 2, 3, ..., 100000])
for user_id in user_ids:
if user_id in valid_ids: # O(1)
process(user_id)
8. deque for queue operations
from collections import deque
# list.pop(0): O(n) — shifts all elements!
queue = list(range(100_000))
# Total for 100K pops: O(n²) = 2.4 seconds
# deque.popleft(): O(1)
queue = deque(range(100_000))
# Total for 100K pops: O(n) = 8ms
# 300x faster
9. functools.lru_cache for repeated calls
from functools import cache # Python 3.9+
@cache
def fibonacci(n):
if n < 2:
return n
return fibonacci(n-1) + fibonacci(n-2)
# Without cache: O(2ⁿ) exponential time
# With cache: O(n) — each value computed once
fibonacci(300) # Returns instantly
10. String joining beats concatenation
# SLOW: O(n²) — new string created every concatenation
result = ""
for part in parts:
result += part
# FAST: O(n) — one allocation
result = "".join(parts) # 10-100x faster for long lists
# f-strings are fastest for formatting:
name, value = "x", 42
f"{name} = {value}" # Fastest
"{} = {}".format(name, value) # Slower
11-20: Power user techniques
- 11. List comprehensions over loops — run in C bytecode, ~40% faster than equivalent for loops
- 12. Use built-ins —
sum(),min(),max(),any(),all()run in C, always faster than Python loops - 13.
operator.methodcallervs lambda —key=methodcaller('lower')is 15% faster thankey=lambda x: x.lower() - 14. Avoid global imports inside functions —
import reinside a function runs on every call. Import at module level. - 15.
orjsoninstead ofjson— 3–10x faster JSON serialization (Rust-backed) - 16.
bytearrayfor mutable bytes — much faster than repeatedbytesconcatenation - 17.
itertoolschains —chain,islice,groupbyare C-level iteration, faster than manual Python - 18. Cython for hot paths — annotate types in .pyx, compile to C. 10–100x speedup on numeric loops
- 19. PyPy for long-running scripts — drop-in CPython replacement with JIT. 5–50x speedup for pure Python
- 20. asyncio for I/O-bound concurrency — thousands of concurrent I/O operations on a single thread via the event loop
These techniques build on the foundation from the Python __slots__ deep dive and the GIL threading guide — understanding the GIL is essential for knowing when multiprocessing beats threading. External reference: Python cProfile documentation.
Recommended Books
→ Designing Data-Intensive Applications — The essential deep-dive on distributed systems, databases, and production engineering at scale.
→ The Pragmatic Programmer — Timeless principles for writing better code, debugging smarter, and advancing as an engineer.
Affiliate links. We earn a small commission at no extra cost to you.
Free Weekly Newsletter
🚀 Don’t Miss the Next Cheat Code
You just read something most developers never learn. Get more secrets like this delivered every week — JavaScript internals, Python optimizations, AWS architectures, system design, and AI workflows.
Join 1,000+ senior developers who actually level up. Zero fluff, pure signal.
Discover more from CheatCoders
Subscribe to get the latest posts sent to your email.
