My Code Works… Until the Device Is Deployed
- Srihari Maddula
- 1 day ago
- 3 min read
Author: Srihari Maddula
Reading Time: 10-12 mins
Tags: Debugging, Reliability, Memory Leaks, Watchdogs, Firmware Engineering

Time is the enemy of firmware. What works for an hour might fail in a month. (Photo by Aron Visuals on Unsplash)
The "One Week" Rule
Here is a scary truth about embedded systems: Most bugs don't show up immediately.
You test your device on the bench for an hour. It passes every test. You ship 100 units to a customer. One week later, the phone calls start. "The screen froze." "It stopped sending data." "The red LED is stuck on."
Why? Because you tested logic, but you didn't test time. In firmware, time allows tiny errors to accumulate until they become catastrophic.
1. The Silent Killer: Memory Fragmentation
In high-level languages like Python or Java, you don't worry about creating strings or objects. The Garbage Collector handles it.
In C/C++ on a microcontroller, using malloc() and free() (dynamic memory allocation) is playing with fire.
The Scenario: You allocate a buffer for a JSON packet. You process it. You free it.Over time, your heap memory gets Swiss-cheesed. You have 10KB of free RAM, but no contiguous block larger than 100 bytes.
The Crash: On Day 7, a slightly larger JSON packet arrives. malloc() returns NULL. Your code doesn't check for NULL. Hard Fault.
The Pro Rule: Avoid dynamic allocation. Use Static Allocation wherever possible. If you need buffers, pre-allocate them at startup. If you must use malloc, use a dedicated block allocator (Memory Pool).
2. The Integer Overflow (The 49-Day Bug)
This is a classic. On Arduino, millis() returns the number of milliseconds since startup. It returns a unsigned long (32-bit).
Math Time:2^32 milliseconds = 4,294,967,296 ms= ~49.7 days.
If your code says:if (millis() - last_time > 1000)
After 49 days, millis() wraps around to 0. Suddenly, (0 - huge_number) results in a massive number (underflow). Your logic breaks. Your device stops updating. The customer reboots it, and it works... for another 49 days.
3. The "Heisenbug": Race Conditions
A race condition happens when two parts of your code try to modify the same variable at the same time.
Example:1. Main loop reads a variable counter.2. An Interrupt fires right in the middle of the read instruction.3. The Interrupt increments counter.4. The Interrupt returns.5. The Main loop finishes reading... the old value (or a corrupted mix of bytes).
These bugs are notoriously hard to reproduce because they depend on microsecond timing. They might happen once a week.
The Fix: Use Atomic Access or disable interrupts briefly (Critical Sections) when reading shared variables.

Finding a race condition is like finding a specific raindrop in a storm. (Photo by Markus Spiske on Unsplash)
4. The Last Line of Defense: The Watchdog
No code is perfect. Eventually, a cosmic ray will flip a bit in your RAM, or a brownout will confuse your program counter.
The Independent Watchdog Timer (IWDG) is a hardware counter that counts down. If it reaches zero, it resets the MCU.
Your Job: "Kick" (reset) the watchdog periodically in your main loop.
The Mistake: Kicking the watchdog inside a timer interrupt. Why it's bad: Your main loop could be completely frozen (deadlocked), but the interrupt keeps firing and kicking the dog. The device appears "alive" to the hardware, but is "dead" to the user.
The Right Way: Kick the dog only when all critical tasks have reported "I'm OK."
Summary: Code for the Long Haul
Writing firmware that runs for 5 minutes is easy. Writing firmware that runs for 5 years is engineering.
No Malloc: Static is safe.
Handle Overflows: Use subtraction for time comparisons, not addition.
Protect Shared Data: Use volatile and critical sections.
Smart Watchdogs: Ensure the system is actually healthy before kicking.
At EurthTech, we define success not by "it works now," but by "it's still working next year."
Recommended Resources
Better Embedded System Software: Phil Koopman's blog on critical safety.
Barr Group Coding Standard: Rules to prevent these exact bugs.
Memfault: Tools for debugging devices remotely after deployment.




Comments