AWS IoT Telemetry and Device Command Architecture

Production IoT Telemetry on AWS: Kinesis, Lambda, DynamoDB, S3, and Event-Driven Device Command Workflows

IoT backends are not just ingest pipelines. They need to survive noisy telemetry, maintain a reliable digital twin of device state, and queue operator intent until the physical fleet is ready to receive it. A connected-device platform must ingest high-cardinality telemetry from millions of devices, trigger alerts in near real time, maintain device digital state, and deliver operator commands reliably even when devices are intermittently connected.

TL;DR: Use Kinesis for telemetry ingestion, Lambda for fast stream processing, DynamoDB for latest-known state, S3 plus Athena for history, and Step Functions with SQS for durable command workflows.

Why Naive Solutions Break

Writing every metric directly into a relational database overwhelms write throughput and storage budgets. Mixing telemetry ingestion, command handling, and alerting inside one service makes the system hard to evolve, and offline devices turn command delivery into a retry nightmare.

Architecture Overview

Use API-based ingestion or broker-fed ingestion into Kinesis, process telemetry with Lambda, store device state in DynamoDB, persist raw telemetry in S3, surface analytics through Glue and Athena, route alerts and commands through EventBridge plus SQS, and coordinate multi-step command workflows with Step Functions.

Architecture Diagram

Service-by-Service Breakdown

Kinesis Data Streams: High-throughput ingestion for telemetry partitions.
Lambda: Real-time transformation, threshold checks, and state updates.
DynamoDB: Device registry, latest-known state, command status, and lease ownership.
S3: Raw telemetry lake for compliance, replay, and ML use.
Glue and Athena: Fleet analytics, trend analysis, and operational ad hoc SQL.
EventBridge: Event routing for alerting, maintenance workflows, and external system integration.
SQS: Durable queue for commands destined to intermittently connected devices.
Step Functions: Multi-step workflows for firmware rollout, retry policies, and escalation.
ElastiCache Redis: Optional fast cache for fleet dashboards or high-frequency state lookups.
CloudWatch and X-Ray: Telemetry lag, alert volume, and command workflow tracing.

Request Lifecycle and Data Flow

Devices publish telemetry to the ingestion layer and into Kinesis.
Lambda consumers validate payloads and normalize device events.
Latest-known device state is updated in DynamoDB.
Raw events are appended to S3 for durable history.
Threshold breaches or anomaly events are emitted to EventBridge.
Alerting, ticketing, or maintenance consumers react asynchronously.
Commands are queued in SQS and executed through Step Functions-based workflows that handle retries and acknowledgment tracking.

Production Code Patterns

Telemetry normalization Lambda

def handler(event, _context):
    for record in event['Records']:
        payload = json.loads(base64.b64decode(record['kinesis']['data']))
        normalized = {
            'deviceId': payload['device_id'],
            'temperature': round(float(payload['temp_c']), 2),
            'ts': payload['timestamp'],
        }
        write_latest_state(normalized)
        archive_raw(payload)

Desired state command record

{
  "pk": "DEVICE#veh-2391",
  "sk": "COMMAND#2026-04-10T10:15:00Z#cmd_8821",
  "commandType": "SET_SPEED_LIMIT",
  "status": "PENDING_DELIVERY",
  "desiredState": { "speedLimitKph": 60 },
  "expiresAt": 1775816100
}

Scaling Strategy

Partition streams by device group, customer, or gateway ID to spread load.
Separate hot device state from immutable telemetry history.
Use queue depth and lag metrics to autoscale workers independently.
Maintain only the latest operational state in DynamoDB; keep long history in S3.
Batch writes to S3 and downstream systems wherever ordering constraints allow.

Cost Optimization Techniques

Avoid indexing raw telemetry into OpenSearch unless specific operational searches justify it.
Keep only recent hot telemetry in fast paths; use Athena for historical analysis.
Compress and partition S3 telemetry by day, fleet, and event type.
Use short TTLs on ephemeral command dedupe records and dashboard cache entries.

Security Best Practices

Authenticate every device and isolate per-device or per-tenant identities.
Encrypt stream, queue, bucket, and table data with KMS.
Use least-privilege IAM for telemetry processors and command workers.
Protect command issuance APIs with stronger operator auth and approval workflows.

Failure Handling and Resilience

Preserve telemetry first, enrich second, to keep replay available.
Treat commands as idempotent operations with explicit command IDs.
Use DLQs for malformed payloads and failed command workflows.
Design for offline devices by storing desired state and applying on reconnect.
Run all core services multi-AZ and keep analytics decoupled from real-time paths.

Trade-offs and Alternatives

This design scales well for append-heavy telemetry and intermittent connectivity, but it adds multiple asynchronous components. A simpler ECS plus Aurora stack may work for smaller fleets, though it usually becomes cost-inefficient for time-series-heavy ingestion.

Real-World Use Case

An industrial monitoring platform or connected-vehicle backend can use this architecture for sensor ingestion, alerting, and remote command execution.

Key Interview Insights

Separate latest state from event history.
Explain why commands need durable intent tracking, not just best-effort delivery.
Discuss ordering and deduplication per device.
Show how replayability and device intermittency shape the design.

Recommended resources

Discover more from CheatCoders

Subscribe to get the latest posts sent to your email.

Production IoT Telemetry on AWS: Kinesis, Lambda, DynamoDB, S3, and Event-Driven Device Command Workflows

Why Naive Solutions Break

Architecture Overview

Architecture Diagram

Service-by-Service Breakdown

Request Lifecycle and Data Flow

Production Code Patterns

Telemetry normalization Lambda

Desired state command record

Scaling Strategy

Cost Optimization Techniques

Security Best Practices

Failure Handling and Resilience

Trade-offs and Alternatives

Real-World Use Case

Key Interview Insights

Recommended resources

Like this:

Related

Discover more from CheatCoders

Why Naive Solutions Break

Architecture Overview

Architecture Diagram

Service-by-Service Breakdown

Request Lifecycle and Data Flow

Production Code Patterns

Telemetry normalization Lambda

Desired state command record

Scaling Strategy

Cost Optimization Techniques

Security Best Practices

Failure Handling and Resilience

Trade-offs and Alternatives

Real-World Use Case

Key Interview Insights

Recommended resources

🚀 Don’t Miss the Next Cheat Code

Share this:

Like this:

Related

Discover more from CheatCoders