
Checkout systems fail in subtle ways: partial payment capture, inventory drift, regional outages, and duplicate orders under retry storms. An online marketplace needs a checkout architecture that survives AZ failure, tolerates partial service degradation, and can fail over across Regions without losing orders or charging customers twice.
TL;DR: Use ECS on Fargate with Aurora PostgreSQL for transactional state, Step Functions for the checkout saga, EventBridge for domain events, and SQS for every slow external side effect.
Why Naive Solutions Break
A synchronous checkout flow that directly calls payment, inventory, shipping, and notification services in sequence creates a fragile distributed transaction. One timeout can leave payment captured without an order confirmation, or inventory reserved without shipment creation. A single-Region database also becomes the blast radius for a regional outage.
Architecture Overview
Use CloudFront and API Gateway for the customer edge, run core checkout services on ECS across multiple AZs, persist transactional order state in Aurora PostgreSQL, publish domain events through EventBridge, coordinate compensation with Step Functions, and isolate side effects behind SQS queues. Replicate data cross-Region with Aurora Global Database for disaster recovery.
Architecture Diagram

Service-by-Service Breakdown
CloudFront: Accelerates storefront and API traffic globally.API Gateway: Entry point for cart, checkout, and order APIs with auth and throttling.ECS on Fargate: Runs checkout, cart, pricing, and order services without managing EC2 fleets.Aurora PostgreSQL: Strong transactional store for orders, payments ledger references, and inventory reservations.Aurora Global Database: Cross-Region replication for low RPO and faster regional recovery.Step Functions: Orchestrates checkout saga steps such as reserve inventory, authorize payment, confirm order, or compensate.EventBridge: BroadcastsOrderPlaced,PaymentAuthorized, andShipmentRequestedto downstream systems.SQS: Buffers calls to shipping integrations, email, and analytics sinks.ElastiCache Redis: Session cache, cart cache, and read acceleration for pricing snapshots.S3: Stores invoices, exports, and immutable order documents.CloudWatch and X-Ray: Per-hop observability, structured logging, alarms, and distributed traces.
Request Lifecycle and Data Flow
- The client submits checkout through CloudFront and API Gateway.
- The checkout service on ECS validates the cart and loads hot cart state from Redis.
- The service writes a pending order record in Aurora inside a local transaction.
- Step Functions starts the checkout saga.
- Payment authorization, inventory reservation, fraud checks, and tax calculation run as separate steps.
- If all steps succeed, the order is committed as confirmed and
OrderPlacedis emitted on EventBridge. - Downstream services consume the event asynchronously for shipping, loyalty, email, and analytics.
- If a step fails, Step Functions triggers compensating actions such as release inventory or void authorization.
Production Code Patterns
Step Functions saga for checkout orchestration
{
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Next": "AuthorizePayment",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "FailCheckout" }]
},
"AuthorizePayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Next": "ConfirmOrder",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "ReleaseInventory" }]
},
"ConfirmOrder": { "Type": "Succeed" },
"ReleaseInventory": { "Type": "Task", "Resource": "arn:aws:states:::ecs:runTask.sync", "Next": "FailCheckout" },
"FailCheckout": { "Type": "Fail" }
}
}
Aurora transaction boundary for order creation
BEGIN;
INSERT INTO orders(order_id, customer_id, status, total_amount, created_at)
VALUES (:order_id, :customer_id, 'PENDING', :total_amount, NOW());
INSERT INTO order_outbox(event_id, aggregate_id, event_type, payload)
VALUES (:event_id, :order_id, 'OrderPending', :payload::jsonb);
COMMIT;
Scaling Strategy
- Scale ECS services independently by function: cart read services, checkout write services, and worker pools.
- Use reader endpoints and cached projections to offload Aurora reads.
- Partition order IDs or tenant scopes logically if the platform is multi-merchant.
- Use SQS to smooth downstream spikes from flash sales.
- Fail over application traffic at the DNS or edge layer only after validating Aurora secondary promotion readiness.
Cost Optimization Techniques
- Use Fargate Spot for non-critical workers and back-office processing.
- Keep Aurora instance classes tuned separately for writer and readers.
- Cache product and pricing reads heavily to reduce database pressure.
- Archive old order analytics to S3 and query via Athena instead of keeping all reporting on Aurora.
Security Best Practices
- Separate PCI-adjacent components into isolated subnets and accounts.
- Enforce IAM task roles per ECS service.
- Use KMS encryption for Aurora, S3, SQS, and Step Functions data.
- Restrict east-west traffic with security groups and private subnets.
- Use Secrets Manager rotation for database credentials.
Failure Handling and Resilience
- Build the checkout flow as a saga, not a two-phase commit across services.
- Use idempotency keys for order submission and payment authorization.
- Add DLQs on all async integrations.
- Regularly rehearse Region failover for Aurora Global Database and application cutover.
- Store immutable event IDs so retries and replays do not create duplicate side effects.
Trade-offs and Alternatives
Aurora simplifies relational invariants and financial consistency, but it demands more capacity planning than DynamoDB. A fully event-sourced design with DynamoDB can scale further, though it increases modeling complexity and eventual-consistency handling during checkout.
Real-World Use Case
An Amazon-style marketplace with flash sales, payment workflows, and downstream fulfillment integrations maps cleanly to this architecture.
Key Interview Insights
- Highlight why checkout is a saga problem, not a simple request-response chain.
- Explain the difference between application failover and database failover.
- Mention idempotency at every boundary: client, payment, message consumers, and event replay.
- Be ready to discuss when strong consistency matters more than raw scale.
Recommended resources
Recommended Reading
→ Designing Data-Intensive Applications — The essential book for understanding distributed systems, databases, and the infrastructure behind architectures like these.
→ System Design Interview Vol. 2 — Covers many of the architectures in this post in interview format with trade-off analysis.
Affiliate links. We earn a small commission at no extra cost to you.
Discover more from CheatCoders
Subscribe to get the latest posts sent to your email.
