Telemetry and Fleet Management System
A system for managing device fleets at scale — health, configuration, and remote operations — backed by a high-throughput telemetry pipeline.
A non-confidential, case-study-style overview. Specifics are generalized.
Problem
Operating a growing device fleet meant answering basic questions — which devices are healthy, what firmware are they on, can I reconfigure a subset safely — that the existing tooling couldn’t answer without manual work. At the same time, raw telemetry volume was outpacing the ability to store and query it cost-effectively.
Decisions
- Built a telemetry pipeline that separated hot-path ingestion from analytical storage, aggregating and down-sampling before long-term persistence.
- Introduced a device-state model — health, configuration, and lifecycle — kept current from device-reported events rather than polled on demand.
- Designed remote operations (configuration, staged updates) around idempotent, acknowledged commands so the fleet could be changed safely in batches.
- Made per-device and per-cohort observability a built-in capability.
Tradeoffs
- Aggregating at ingestion reduced storage and query cost but meant deciding early which fidelity to keep — a reversible-but-not-free decision.
- An eventually-consistent device-state model favored availability and throughput over always-current reads, which fit the operational use cases.
- Staged, acknowledged rollouts were slower than fire-and-forget but eliminated whole classes of fleet-wide mistakes.
Outcome
Operators gained a current, queryable view of fleet health and the ability to make changes in controlled batches with a known blast radius. Telemetry costs became predictable as volume grew, and incident response shifted from manual investigation to dashboards and targeted action.
Technologies
AWS IoT Core · telemetry pipeline · stream processing · device-state modeling · staged remote operations · TypeScript.