How to Build a Real-Time Fraud Detection and Decisioning Architecture on AWS with EventBridge, Step Functions, and DynamoDB

How to Build a Real-Time Fraud Detection and Decisioning Architecture on AWS with EventBridge, Step Functions, and DynamoDB
How to Build a Real-Time Fraud Detection and Decisioning Architecture on AWS with EventBridge, Step Functions, and DynamoDB

Fraud systems fail when they do too much synchronously or when they treat explainability as optional. In production, the hard part is staying fast while preserving enough evidence to defend every decision later. A payments or marketplace platform must score transactions in real time, combine historical risk context with streaming behavior, and return a decision within milliseconds to seconds while preserving full auditability.

TL;DR: Keep the real-time decision path tiny with Lambda, Redis, and DynamoDB, then push enrichment and analyst workflows into Step Functions, SQS, EventBridge, and an S3-backed analytics trail.

Why Naive Solutions Break

Embedding all fraud rules inside a single transaction service leads to constant redeployments, poor experimentation velocity, and inconsistent latency as the rule set grows. Synchronous dependency chains also make the checkout path fragile when third-party risk providers or feature stores degrade.

Architecture Overview

Use API Gateway or service-to-service ingestion to submit decision requests, evaluate synchronous lightweight checks with Lambda, fetch hot features from ElastiCache and DynamoDB, orchestrate deeper checks with Step Functions, emit fraud events on EventBridge, and archive signals in S3 for analytics with Glue and Athena.

Architecture Diagram

Real-Time Fraud Decisioning

Service-by-Service Breakdown

  • API Gateway: Receives risk scoring requests from checkout or account systems.
  • Lambda: Handles lightweight feature extraction, static rules, and decision prechecks.
  • DynamoDB: Stores account risk profiles, device fingerprints, rule versions, and decision records.
  • ElastiCache Redis: Holds hot counters like login velocity, card attempt windows, and device abuse frequencies.
  • Step Functions: Orchestrates optional third-party checks, manual review routing, and compensation workflows.
  • EventBridge: Distributes FraudDecisionCreated, ManualReviewRequested, and ChargebackReceived events.
  • SQS: Buffers non-blocking workflows such as analyst notifications or case enrichment.
  • S3: Stores raw decision artifacts, feature snapshots, and explainability documents.
  • Glue and Athena: Query historical decisions and train or evaluate new risk heuristics.
  • CloudWatch and X-Ray: Track decision latency budgets, downstream timeout rates, and rule-specific error hotspots.

Request Lifecycle and Data Flow

  1. Checkout sends a decision request with transaction, device, and account metadata.
  2. Lambda performs low-latency checks using Redis counters and DynamoDB account state.
  3. If the decision is obvious, Lambda returns approve, deny, or challenge immediately.
  4. If extra checks are needed, Step Functions fans out to enrichment or manual review paths.
  5. The final decision is stored in DynamoDB and emitted on EventBridge.
  6. Async consumers update cases, notify analysts, and archive artifacts in S3.
  7. Athena queries historical fraud outcomes for rule tuning and post-incident analysis.

Production Code Patterns

Low-latency risk scoring with Redis counters

async function scoreTransaction({ accountId, deviceId, amount, redis, dynamo }) {
  const [deviceAttempts, accountProfile] = await Promise.all([
    redis.incr(`fraud:device:${deviceId}:5m`),
    dynamo.get({ TableName: process.env.RISK_TABLE, Key: { pk: `ACCOUNT#${accountId}`, sk: 'PROFILE' } }).promise(),
  ]);

  const score = (deviceAttempts > 20 ? 40 : 0) + (accountProfile.Item?.chargebackRate || 0) * 100;
  return { score, action: score >= 70 ? 'REVIEW' : 'APPROVE' };
}

Step Functions branch for manual review escalation

{
  "StartAt": "SynchronousChecks",
  "States": {
    "SynchronousChecks": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Next": "NeedsReview?" },
    "NeedsReview?": {
      "Type": "Choice",
      "Choices": [{ "Variable": "$.action", "StringEquals": "REVIEW", "Next": "QueueManualReview" }],
      "Default": "PersistDecision"
    },
    "QueueManualReview": { "Type": "Task", "Resource": "arn:aws:states:::sqs:sendMessage", "Next": "PersistDecision" },
    "PersistDecision": { "Type": "Succeed" }
  }
}

Scaling Strategy

  • Keep the synchronous decision path tiny; push enrichment and non-critical side effects async.
  • Use Redis for sliding-window counters instead of repeatedly scanning DynamoDB.
  • Partition DynamoDB by account, device, or merchant scope with time-bucketed sort keys.
  • Use SQS to isolate downstream review systems from real-time scoring.
  • Parallelize independent checks inside Step Functions where latency budget allows.

Cost Optimization Techniques

  • Cache high-churn risk features in Redis with short TTLs.
  • Store full decision evidence in S3, not DynamoDB, when the payload is large.
  • Use Athena over partitioned S3 logs for retrospective analysis instead of always-on analytics clusters.
  • Limit synchronous external calls to high-risk transactions only.

Security Best Practices

  • Encrypt all decision artifacts and counters with KMS-backed services.
  • Enforce strict IAM boundaries between scoring, case management, and analytics roles.
  • Tokenize sensitive payment identifiers before broad event fan-out.
  • Use VPC-enabled Lambdas only where necessary to avoid expanding network blast radius unnecessarily.

Failure Handling and Resilience

  • Define fail-open or fail-closed behavior explicitly per transaction type.
  • Apply timeouts and circuit breakers on optional checks.
  • Use idempotency tokens for repeated decision requests.
  • Route failed async enrichments to DLQs and preserve raw requests in S3 for replay.
  • Alarm on p95 and p99 latency, not just error rate, because fraud systems can fail by becoming too slow.

Trade-offs and Alternatives

This architecture balances real-time decisions with asynchronous depth, but it is operationally more complex than embedding rules in one service. If fraud logic is simple, a single ECS service plus Aurora may be enough. At scale, event-driven separation improves agility and auditability.

Real-World Use Case

An Uber-style rider and driver risk engine can use this model to detect promo abuse, account takeovers, and payment anomalies in near real time.

Key Interview Insights

  • State the latency budget and design around it.
  • Explain why some fraud checks must be synchronous while others can be eventual.
  • Discuss fail-open versus fail-closed in business terms, not just technical terms.
  • Mention explainability and audit trails as system requirements, not afterthoughts.

Recommended resources

Recommended Reading

Designing Data-Intensive Applications — The essential book for understanding distributed systems, databases, and the infrastructure behind architectures like these.

System Design Interview Vol. 2 — Covers many of the architectures in this post in interview format with trade-off analysis.

Affiliate links. We earn a small commission at no extra cost to you.


Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply