
IoT backends are not just ingest pipelines. They need to survive noisy telemetry, maintain a reliable digital twin of device state, and queue operator intent until the physical fleet is ready to receive it. A connected-device platform must ingest high-cardinality telemetry from millions of devices, trigger alerts in near real time, maintain device digital state, and deliver operator commands reliably even when devices are intermittently connected.
TL;DR: Use Kinesis for telemetry ingestion, Lambda for fast stream processing, DynamoDB for latest-known state, S3 plus Athena for history, and Step Functions with SQS for durable command workflows.
Why Naive Solutions Break
Writing every metric directly into a relational database overwhelms write throughput and storage budgets. Mixing telemetry ingestion, command handling, and alerting inside one service makes the system hard to evolve, and offline devices turn command delivery into a retry nightmare.
Architecture Overview
Use API-based ingestion or broker-fed ingestion into Kinesis, process telemetry with Lambda, store device state in DynamoDB, persist raw telemetry in S3, surface analytics through Glue and Athena, route alerts and commands through EventBridge plus SQS, and coordinate multi-step command workflows with Step Functions.
Architecture Diagram

Service-by-Service Breakdown
Kinesis Data Streams: High-throughput ingestion for telemetry partitions.Lambda: Real-time transformation, threshold checks, and state updates.DynamoDB: Device registry, latest-known state, command status, and lease ownership.S3: Raw telemetry lake for compliance, replay, and ML use.Glue and Athena: Fleet analytics, trend analysis, and operational ad hoc SQL.EventBridge: Event routing for alerting, maintenance workflows, and external system integration.SQS: Durable queue for commands destined to intermittently connected devices.Step Functions: Multi-step workflows for firmware rollout, retry policies, and escalation.ElastiCache Redis: Optional fast cache for fleet dashboards or high-frequency state lookups.CloudWatch and X-Ray: Telemetry lag, alert volume, and command workflow tracing.
Request Lifecycle and Data Flow
- Devices publish telemetry to the ingestion layer and into Kinesis.
- Lambda consumers validate payloads and normalize device events.
- Latest-known device state is updated in DynamoDB.
- Raw events are appended to S3 for durable history.
- Threshold breaches or anomaly events are emitted to EventBridge.
- Alerting, ticketing, or maintenance consumers react asynchronously.
- Commands are queued in SQS and executed through Step Functions-based workflows that handle retries and acknowledgment tracking.
Production Code Patterns
Telemetry normalization Lambda
def handler(event, _context):
for record in event['Records']:
payload = json.loads(base64.b64decode(record['kinesis']['data']))
normalized = {
'deviceId': payload['device_id'],
'temperature': round(float(payload['temp_c']), 2),
'ts': payload['timestamp'],
}
write_latest_state(normalized)
archive_raw(payload)
Desired state command record
{
"pk": "DEVICE#veh-2391",
"sk": "COMMAND#2026-04-10T10:15:00Z#cmd_8821",
"commandType": "SET_SPEED_LIMIT",
"status": "PENDING_DELIVERY",
"desiredState": { "speedLimitKph": 60 },
"expiresAt": 1775816100
}
Scaling Strategy
- Partition streams by device group, customer, or gateway ID to spread load.
- Separate hot device state from immutable telemetry history.
- Use queue depth and lag metrics to autoscale workers independently.
- Maintain only the latest operational state in DynamoDB; keep long history in S3.
- Batch writes to S3 and downstream systems wherever ordering constraints allow.
Cost Optimization Techniques
- Avoid indexing raw telemetry into OpenSearch unless specific operational searches justify it.
- Keep only recent hot telemetry in fast paths; use Athena for historical analysis.
- Compress and partition S3 telemetry by day, fleet, and event type.
- Use short TTLs on ephemeral command dedupe records and dashboard cache entries.
Security Best Practices
- Authenticate every device and isolate per-device or per-tenant identities.
- Encrypt stream, queue, bucket, and table data with KMS.
- Use least-privilege IAM for telemetry processors and command workers.
- Protect command issuance APIs with stronger operator auth and approval workflows.
Failure Handling and Resilience
- Preserve telemetry first, enrich second, to keep replay available.
- Treat commands as idempotent operations with explicit command IDs.
- Use DLQs for malformed payloads and failed command workflows.
- Design for offline devices by storing desired state and applying on reconnect.
- Run all core services multi-AZ and keep analytics decoupled from real-time paths.
Trade-offs and Alternatives
This design scales well for append-heavy telemetry and intermittent connectivity, but it adds multiple asynchronous components. A simpler ECS plus Aurora stack may work for smaller fleets, though it usually becomes cost-inefficient for time-series-heavy ingestion.
Real-World Use Case
An industrial monitoring platform or connected-vehicle backend can use this architecture for sensor ingestion, alerting, and remote command execution.
Key Interview Insights
- Separate latest state from event history.
- Explain why commands need durable intent tracking, not just best-effort delivery.
- Discuss ordering and deduplication per device.
- Show how replayability and device intermittency shape the design.
Recommended resources
Recommended Reading
→ Designing Data-Intensive Applications — The essential book for understanding distributed systems, databases, and the infrastructure behind architectures like these.
→ System Design Interview Vol. 2 — Covers many of the architectures in this post in interview format with trade-off analysis.
Affiliate links. We earn a small commission at no extra cost to you.
Discover more from CheatCoders
Subscribe to get the latest posts sent to your email.
