Why IoT Devices Fail Even When They Meet All Specifications

Srihari Maddula
Feb 22
4 min read

Srihari Maddula

Many IoT devices fail without ever violating a single specification. They pass certification, meet datasheet limits, conform to protocol requirements, and behave exactly as they were designed to. Yet months or years after deployment, reliability degrades, data quality drops, or behaviour becomes unpredictable. Nothing dramatic breaks. Nothing obviously violates a requirement. The system simply stops being trustworthy.

This is one of the most uncomfortable failure modes in real deployments because it leaves teams with no clear culprit. The design is compliant. The components are qualified. The firmware behaves as specified. And yet the system fails anyway.

The reason is simple and rarely acknowledged: specifications describe components in isolation, while failures emerge at the system level, over time.

Specifications describe correctness, not survival

Most specifications are written to validate correctness at a point in time. A sensor meets its accuracy range. A radio meets sensitivity and output power limits. A power converter meets efficiency targets. A protocol meets timing and retry rules. Each part behaves as documented.

What specifications do not describe is how these parts interact after thousands of operating hours, under fluctuating power, changing RF environments, temperature cycles, firmware evolution, and partial failures. They do not model drift, aging, or the compounding effect of small degradations across subsystems.

A system can be perfectly compliant and still be fragile, because no specification captures how margins erode together.

Hidden assumptions are where failures begin

Every specification carries implicit assumptions. Assumptions about clean power, stable clocks, predictable temperature, benign RF environments, and periodic human intervention. These assumptions are rarely written down because they are considered “normal operating conditions.”

In practice, these assumptions decay gradually:

Power quality degrades as batteries age and regulators heat
Clock accuracy drifts with temperature and time
RF environments change as infrastructure evolves
Firmware grows more complex than originally planned

In the lab, these assumptions usually hold. In the field, they quietly stop being true.

Failures begin not when a limit is crossed, but when assumptions decay.

Lab validation does not model time

Most systems are validated in controlled environments. Devices are restarted frequently. Power is clean. RF conditions are stable. Temperature is fixed. Logs are actively monitored. Tests focus on functional correctness and edge cases defined by specifications.

Real deployments look nothing like this. Devices are expected to run continuously for years, often unattended. Power quality varies. Temperature cycles daily and seasonally. RF environments evolve as equipment is added or removed. Connectivity drops for long periods.

The system may behave correctly for thousands of hours before a rare interaction surfaces. By then, the failure is hard to reproduce and even harder to attribute to a single cause.

Timing failures rarely violate limits

A large class of field failures begins with timing drift rather than outright faults. Clock inaccuracies accumulate. Task scheduling shifts under load. Interrupt latency increases as firmware grows. Retry patterns unintentionally align across nodes.

Individually, these effects remain within specification. Together, they cause:

Missed timing windows
Subtle state-machine desynchronisation
Increased retries and latency
Reduced overall reliability

The system still runs, but behaviour drifts far from what was originally validated.

Power behaviour changes over a lifecycle

Power specifications usually describe steady-state behaviour. They do not describe how systems behave as batteries age, regulators heat, leakage increases, or brownout thresholds are crossed more frequently.

Over time, marginal power events become common. Flash writes fail intermittently. Sensors reset unexpectedly. Radios retransmit more often. None of this violates a datasheet. All of it affects system behaviour.

Power-related failures are often misdiagnosed because the system “meets spec” when measured under ideal conditions.

RF compliance is not RF robustness

Radio specifications focus on measurable parameters: sensitivity, output power, modulation accuracy, spectral masks. They do not account for real-world RF ecosystems.

In real deployments:

Antennas detune due to enclosure effects and aging
Multipath dominates in industrial and urban environments
Interference sources appear years after deployment
Regulatory constraints evolve

A radio can remain compliant while link quality steadily degrades. Retries increase, latency rises, energy consumption grows, and nodes become unreliable, even though nothing is technically out of spec.

Firmware evolves, assumptions don’t

Most specifications implicitly assume static firmware. Real products never stay static.

Features are added. Workarounds accumulate. Debug paths remain. Timing paths lengthen. Memory usage grows. Each change is reasonable in isolation. Together, they alter system dynamics.

The device still meets functional requirements, but the original assumptions about timing, power, and interaction no longer hold. Failures emerge not from a single bug, but from accumulated complexity.

Observability is rarely specified

Specifications focus on outputs, not introspection. As long as values remain within acceptable ranges, the system is considered healthy.

This leads to silent failures:

Sensors drift but stay within limits
Timing margins shrink without alarms
Power instability increases invisibly

By the time a failure is noticed, the system has often been unhealthy for a long time.

Specifications do not demand observability. Survivable systems do.

Meeting specs is not the same as being resilient

Specifications optimise for correctness at design time. Real deployments demand resilience over time.

Resilience requires anticipating degradation, modelling interactions, and designing systems that can detect when their own assumptions are failing. It requires accepting that specifications describe only a narrow slice of reality.

A system designed only to meet specifications will eventually fail. A system designed to survive assumption failure will degrade more gracefully and predictably.

The EurthTech perspective

At EurthTech, many of the failures we encounter occur in systems that are fully compliant. The issue is not poor engineering. It is a structural gap between specification-driven design and real-world operation.

We design systems by explicitly asking where assumptions will decay, how margins will erode, and what happens when components remain compliant but interactions change. Specifications remain important, but they are treated as a starting point, not a guarantee.

Systems that survive real deployments are not the ones that meet specifications most precisely. They are the ones designed with the expectation that specifications will eventually stop describing reality—and that the system must still function when they do.