AWS CloudWatch Insights Queries That Actually Find Production Bugs

AWS CloudWatch Insights Queries That Actually Find Production Bugs

Most developers use CloudWatch Logs like a text file. CloudWatch Logs Insights transforms your logs into a queryable database. These queries find production bugs that manual log searching misses entirely.

TL;DR: CloudWatch Insights uses SQL-like syntax on your log groups. filter, stats, sort, and parse handle 90% of production debugging. Combine them for database-quality analytics on raw logs.

The 4 Core Commands

# The 4 commands that handle everything:
filter @message like /ERROR/           # Select matching lines
stats count() as errors by bin(5m)    # Aggregate data
sort @timestamp desc                   # Order results  
parse @message "duration: * ms" as ms  # Extract fields

# Chain them:
filter @type = "REPORT"
| stats avg(@duration) as avg, percentile(@duration, 99) as p99
        by bin(1h)
| sort avg desc

Query 1: Find All Lambda Cold Starts

filter @type = "REPORT"
| filter ispresent(@initDuration)
| stats count() as coldStarts,
        avg(@initDuration) as avgInitMs,
        max(@initDuration) as maxInitMs,
        percentile(@initDuration, 95) as p95InitMs
| sort coldStarts desc
# @initDuration only appears on cold start invocations

Query 2: API Latency Percentiles Over Time

filter @type = "REPORT"
| stats percentile(@duration, 50) as p50,
        percentile(@duration, 95) as p95,
        percentile(@duration, 99) as p99,
        count() as requests
        by bin(5m)
| sort @timestamp asc
# Use p99, not avg — averages hide tail latency problems

Query 3: Lambda Memory Utilization

filter @type = "REPORT"
| parse @message "Memory Size: * MB" as memorySize
| parse @message "Max Memory Used: * MB" as memoryUsed
| stats max(memoryUsed) as maxUsedMB,
        max(memorySize) as allocatedMB
| extend utilizationPct = maxUsedMB / allocatedMB * 100
| filter utilizationPct > 80
# > 80% = OOM risk. Increase memory allocation.

Query 4: Error Rate by Time Window

filter @type = "REPORT"
| stats count(@duration > 3000) as slowRequests,
        count() as total
        by bin(5m)
| extend slowRate = slowRequests / total * 100
| sort @timestamp asc
# Shows exactly when your API slowed down

Query 5: Top Endpoints by Error Rate

parse @message "[*] * * * *" as ts, method, path, status, latency
| stats count(status >= 400) as errors,
        count() as total,
        avg(latency) as avgLatency
        by path, method
| extend errorRate = errors / total * 100
| filter total > 10
| sort errorRate desc

Query 6: Detect Memory Leaks Over Time

filter @type = "REPORT"
| parse @message "Max Memory Used: * MB" as memUsed
| stats max(memUsed) as maxMem,
        min(memUsed) as minMem,
        count() as invocations
        by @logStream
| filter invocations > 10
| extend memGrowth = maxMem - minMem
| filter memGrowth > 50
| sort memGrowth desc
# Same container showing 50MB+ growth = likely memory leak

Query 7: Cost Analysis by Function

filter @type = "REPORT"
| parse @message "Billed Duration: * ms" as billedMs
| parse @message "Memory Size: * MB" as memMB
| extend gbSeconds = (billedMs / 1000) * (memMB / 1024)
| extend costUSD = gbSeconds * 0.0000166667
| stats sum(costUSD) as totalCostUSD,
        count() as invocations
        by bin(1d)
| sort totalCostUSD desc
# Find your most expensive time windows

CloudWatch Insights Cheat Sheet

  • ✅ Use bin(5m) for time-series — shows spikes clearly
  • ✅ Use percentile(@duration, 99) not avg — averages hide tail latency
  • ✅ Use ispresent(@initDuration) to filter cold starts specifically
  • ✅ Use parse to extract fields from unstructured messages
  • ✅ Save frequent queries as CloudWatch saved queries
  • ❌ Never query without a time range — scans cost money
  • ❌ Switch to structured JSON logging for production — much faster to query

These queries pair directly with the Lambda cold start optimization guide — use Query 1 to measure cold starts before and after applying fixes. For DynamoDB-backed functions, the AWS security guide shows how to log presigned URL generation events. Official reference: CloudWatch Insights query syntax.

Master AWS monitoring and observability

View Course on Udemy — Hands-on video course covering every concept in this post and more.

Sponsored link. We may earn a commission at no extra cost to you.


Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

1 Comment

Leave a Reply