Telemetry and Fleet Management System

A non-confidential, case-study-style overview. Specifics are generalized.

Problem

Operating a growing device fleet meant answering basic questions — which devices are healthy, what firmware are they on, can I reconfigure a subset safely — that the existing tooling couldn’t answer without manual work. At the same time, raw telemetry volume was outpacing the ability to store and query it cost-effectively.

Decisions

Built a telemetry pipeline that separated hot-path ingestion from analytical storage, aggregating and down-sampling before long-term persistence.
Introduced a device-state model — health, configuration, and lifecycle — kept current from device-reported events rather than polled on demand.
Designed remote operations (configuration, staged updates) around idempotent, acknowledged commands so the fleet could be changed safely in batches.
Made per-device and per-cohort observability a built-in capability.

Tradeoffs

Aggregating at ingestion reduced storage and query cost but meant deciding early which fidelity to keep — a reversible-but-not-free decision.
An eventually-consistent device-state model favored availability and throughput over always-current reads, which fit the operational use cases.
Staged, acknowledged rollouts were slower than fire-and-forget but eliminated whole classes of fleet-wide mistakes.

Outcome

Operators gained a current, queryable view of fleet health and the ability to make changes in controlled batches with a known blast radius. Telemetry costs became predictable as volume grew, and incident response shifted from manual investigation to dashboards and targeted action.

Technologies

AWS IoT Core · telemetry pipeline · stream processing · device-state modeling · staged remote operations · TypeScript.