Debugging Production Postgres Slow Queries Without Causing Downtime

Debugging Production Postgres Slow Queries Without Causing Downtime

A Postgres query that ran fine for months suddenly takes 30 seconds. Users are complaining. You need answers fast — but the wrong diagnostic approach adds load to an already-struggling database. These techniques find the problem without making things worse.

TL;DR: Use pg_stat_activity for live query inspection, pg_stat_statements for historical slow query data, auto_explain for automatic EXPLAIN on slow queries, and pg_locks for lock contention. None of these require downtime or significant additional load.

Step 1: See What’s Running Right Now

-- See all active queries sorted by duration
SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state,
  wait_event_type,
  wait_event
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start IS NOT NULL
ORDER BY duration DESC;

-- Queries running for over 30 seconds
SELECT pid, duration, query
FROM (
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active'
) t
WHERE duration > interval '30 seconds';

-- Kill a specific runaway query (never use terminate in production unless certain)
SELECT pg_cancel_backend(PID); -- Sends SIGINT, allows graceful cancel
-- pg_terminate_backend(PID)   -- SIGTERM, only if cancel doesn't work

Step 2: Find Locks Causing Blocking

-- Find blocking queries and their victims
SELECT
  blocked_locks.pid AS blocked_pid,
  blocked_activity.query AS blocked_query,
  blocking_locks.pid AS blocking_pid,
  blocking_activity.query AS blocking_query,
  now() - blocked_activity.query_start AS blocked_duration
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.relation = blocked_locks.relation
  AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
  ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Common causes of blocking:
-- 1. Long-running transaction holding lock
-- 2. ALTER TABLE (takes ACCESS EXCLUSIVE lock)
-- 3. VACUUM FULL (takes ACCESS EXCLUSIVE lock)
-- 4. Missing index causing full table scan while holding row locks

Step 3: pg_stat_statements — Historical Slow Queries

-- Enable pg_stat_statements (add to postgresql.conf, reload not restart)
-- shared_preload_libraries = 'pg_stat_statements'
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries by total time
SELECT
  round(total_exec_time::numeric, 2) AS total_ms,
  calls,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  rows,
  round(100 * total_exec_time / sum(total_exec_time) OVER (), 2) AS pct_total,
  LEFT(query, 100) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Queries with highest variance (inconsistently slow)
SELECT LEFT(query, 100), calls, mean_exec_time, stddev_exec_time
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time DESC
LIMIT 10;
-- High stddev = sometimes fast, sometimes slow = index bloat or lock waits

Step 4: auto_explain — Capture Plans of Slow Queries Automatically

-- Enable auto_explain (no restart required, hot-reload)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1000; -- Log queries over 1 second
SET auto_explain.log_analyze = true;       -- Include actual row counts
SET auto_explain.log_buffers = true;       -- Include buffer hit/miss stats
SET auto_explain.log_timing = true;
SET auto_explain.log_nested_statements = true;

-- These settings are session-level, add to postgresql.conf for global:
-- auto_explain.log_min_duration = 1000
-- auto_explain.log_analyze = on

-- Now any query taking > 1s gets its EXPLAIN ANALYZE logged automatically
-- Check PostgreSQL logs for entries like:
-- LOG: duration: 2341.123 ms  plan:
--   Seq Scan on orders (cost=0.00..250000.00 rows=5000000 ...)
--   (actual time=0.027..2340.891 rows=4999999 ...)

Step 5: Index Bloat — The Silent Performance Killer

-- Check index bloat (pages used vs pages actually needed)
SELECT
  schemaname,
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS scans,
  idx_tup_read AS tuples_read,
  idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

-- Unused indexes (never been scanned)
SELECT indexrelid::regclass AS index, relid::regclass AS table, idx_scan
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0
  AND indisunique IS FALSE;  -- Exclude unique constraints
-- Drop unused indexes — they slow down writes for zero read benefit

-- Rebuild bloated index online (no lock!)
REINDEX INDEX CONCURRENTLY idx_orders_status;  -- PostgreSQL 12+

Production Postgres Debug Cheat Sheet

  • pg_stat_activity — see live queries and wait events
  • pg_stat_statements — historical aggregates of query performance
  • auto_explain — automatic EXPLAIN on slow queries in production logs
  • ✅ Lock detection query — find blocking + blocked query pairs
  • pg_cancel_backend(pid) to cancel gracefully, pg_terminate_backend as last resort
  • REINDEX CONCURRENTLY to rebuild indexes without locking
  • ❌ Never use VACUUM FULL on production during peak traffic
  • ❌ Never drop an index without checking idx_scan first

This guide pairs directly with the PostgreSQL query planner optimization guide — that covers why queries are slow, this covers how to find which ones are slow in production. For the application layer, the API debugging guide shows how to isolate whether slowness is in the database or the application layer. Official reference: PostgreSQL statistics collector documentation.

Master PostgreSQL performance and debugging

View Course on Udemy — Hands-on video course covering every concept in this post and more.

Sponsored link. We may earn a commission at no extra cost to you.


Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply