Python Performance Secrets: 20 Techniques to Make Your Code 10x Faster

Python Performance Secrets: 20 Techniques to Make Your Code 10x Faster

Python performance is not about the language — it is about how you use it. The gap between naive Python and optimized Python is 10–100x, and most of that gap is closeable without leaving Python at all. These are the 20 techniques working engineers use in production.

TL;DR: Profile first (80% of time is in 20% of code). Fix algorithmic complexity first. Then: use local variables, list comprehensions, built-ins, NumPy for numeric, generators for large data, __slots__ for many objects, and multiprocessing for CPU-bound work.

1. Profile before optimizing

import cProfile, pstats, io

def profile(func):
    def wrapper(*args, **kwargs):
        pr = cProfile.Profile()
        pr.enable()
        result = func(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
        ps.print_stats(20)  # Top 20 slowest functions
        print(s.getvalue())
        return result
    return wrapper

@profile
def my_function():
    pass

2. Local variables beat global lookups

import math

# SLOW: attribute lookup on module every iteration
def slow_sqrt(numbers):
    return [math.sqrt(n) for n in numbers]

# FAST: cache attribute lookup — 33% faster
def fast_sqrt(numbers):
    local_sqrt = math.sqrt
    return [local_sqrt(n) for n in numbers]

3. NumPy vectorization — biggest win for numeric code

import numpy as np

# Python loop: 320ms on 1M elements
def python_loop(data):
    return [x**2 + 2*x + 1 for x in data]

# NumPy vectorized: 3ms — 100x faster
def numpy_vectorized(data):
    arr = np.array(data)
    return arr**2 + 2*arr + 1

# Boolean indexing instead of filtering:
data = np.random.randn(1_000_000)
positive = data[data > 0]  # Much faster than list filter

4. __slots__ for memory and speed

import sys

class PointSlotted:
    __slots__ = ('x', 'y', 'z')
    def __init__(self, x, y, z):
        self.x = x; self.y = y; self.z = z

# Regular class: ~360 bytes per instance
# Slotted class: ~64 bytes per instance (5.6x smaller)
# For 1M objects: 360MB vs 64MB
# Attribute access also 30% faster

5. Generators for large datasets

# Pipeline: O(1) memory regardless of data size
def pipeline(source):
    cleaned = (clean(row) for row in source)
    filtered = (row for row in cleaned if is_valid(row))
    transformed = (transform(row) for row in filtered)
    return transformed

# Process 100GB file without loading into memory:
for result in pipeline(read_file('huge.csv')):
    write_output(result)

6. Multiprocessing for CPU-bound work

from multiprocessing import Pool
import os

def cpu_intensive(n):
    return sum(i**2 for i in range(n))

data = [10_000_000] * 8

# Sequential: blocked by GIL
results = [cpu_intensive(n) for n in data]

# Multiprocessing: true parallelism — ~8x speedup on 8-core machine
with Pool(processes=os.cpu_count()) as pool:
    results = pool.map(cpu_intensive, data)

7. Use set for membership testing

# x in list: O(n)
# x in set:  O(1)

# SLOW: O(n) per check
valid_ids = [1, 2, 3, ..., 100000]  # list
for user_id in user_ids:
    if user_id in valid_ids:  # O(n) every time!
        process(user_id)

# FAST: O(1) per check
valid_ids = set([1, 2, 3, ..., 100000])
for user_id in user_ids:
    if user_id in valid_ids:  # O(1)
        process(user_id)

8. deque for queue operations

from collections import deque

# list.pop(0): O(n) — shifts all elements!
queue = list(range(100_000))
# Total for 100K pops: O(n²) = 2.4 seconds

# deque.popleft(): O(1)
queue = deque(range(100_000))
# Total for 100K pops: O(n) = 8ms
# 300x faster

9. functools.lru_cache for repeated calls

from functools import cache  # Python 3.9+

@cache
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# Without cache: O(2ⁿ) exponential time
# With cache: O(n) — each value computed once
fibonacci(300)  # Returns instantly

10. String joining beats concatenation

# SLOW: O(n²) — new string created every concatenation
result = ""
for part in parts:
    result += part

# FAST: O(n) — one allocation
result = "".join(parts)  # 10-100x faster for long lists

# f-strings are fastest for formatting:
name, value = "x", 42
f"{name} = {value}"  # Fastest
"{}  = {}".format(name, value)  # Slower

11-20: Power user techniques

  • 11. List comprehensions over loops — run in C bytecode, ~40% faster than equivalent for loops
  • 12. Use built-inssum(), min(), max(), any(), all() run in C, always faster than Python loops
  • 13. operator.methodcaller vs lambdakey=methodcaller('lower') is 15% faster than key=lambda x: x.lower()
  • 14. Avoid global imports inside functionsimport re inside a function runs on every call. Import at module level.
  • 15. orjson instead of json — 3–10x faster JSON serialization (Rust-backed)
  • 16. bytearray for mutable bytes — much faster than repeated bytes concatenation
  • 17. itertools chainschain, islice, groupby are C-level iteration, faster than manual Python
  • 18. Cython for hot paths — annotate types in .pyx, compile to C. 10–100x speedup on numeric loops
  • 19. PyPy for long-running scripts — drop-in CPython replacement with JIT. 5–50x speedup for pure Python
  • 20. asyncio for I/O-bound concurrency — thousands of concurrent I/O operations on a single thread via the event loop

These techniques build on the foundation from the Python __slots__ deep dive and the GIL threading guide — understanding the GIL is essential for knowing when multiprocessing beats threading. External reference: Python cProfile documentation.

Recommended Books

Designing Data-Intensive Applications — The essential deep-dive on distributed systems, databases, and production engineering at scale.

The Pragmatic Programmer — Timeless principles for writing better code, debugging smarter, and advancing as an engineer.

Affiliate links. We earn a small commission at no extra cost to you.


Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.