Imagine your product is a spacecraft orbiting Mars, 140 million miles from home. And it keeps resetting, wasting millions of dollars of mission funding. You’ve identified the problem and now you have to update the code to fix it–in flight–140 million miles away.
Don’t brick it!
That was the problem facing the Mars Pathfinder team in 1997 (see also additional information from the Pathfinder software team lead).
This is an extreme example of a common problem: the need to update fielded devices, including those that are misbehaving.
The solution is Over The Air update (OTA), sometimes referred to as Firmware Over The Air update (FOTA). It takes advantage of modern connectivity for devices. Even for devices that aren’t normally connected to communication networks, it’s often possible to connect them to a computer or mobile device that is connected to the network and update them.
OTA isn’t just useful for fixing bugs. It’s also become a common part of development and release strategy. You can release a product with initial capability, then add more features over time via OTA.
OTA can get surprisingly complex in a number of ways. It involves a lot of moving parts and potential failure points. There are a number of options for implementing it, depending on MCU architecture, board design, operational environment, and use case.
Relationship To Bootloader
Closely related to OTA is the bootloader. This is the initial code that starts running when you boot a device. It runs the embedded system application image that is the actual functional code of the device.
One of the jobs of a bootloader is to load updated code if it finds it present on the device, or if it detects that an update is being attempted in real-time as it boots the device. There are a number of ways of implementing bootloaders and their update steps. OTA is just one possible way to get an image to the bootloader.
The overall system consists of device components, server components, and communications infrastructure.
The devices contain components to receive and apply the OTA. The servers contain components to send the OTA, and must manage the OTA’s to all the target devices being updated. The communications infrastructure provides the media and bandwidth for transporting all the OTA data between servers and devices. All 3 include security elements.
The path between the servers and the devices may be over public or private data networks and will involve a variety of equipment and technologies, such as commercial Internet, cellular, and satellite communications. It typically involves multiple third-party links, creating dependencies on external services. There may be multiple communications paths available at the device level, for instance via WiFi, cellular data connection, or other RF connection. While OTA typically refers to wireless connections (the “air” in OTA, even across the vacuum of space!), it may also include wired connections.
This matches the overall model for IoT (Internet Of Things). OTA is a common feature of IoT devices.
OTA has three general phases on a device:
- Download: receive the OTA data from the server and store it on the device.
- Verification: verify the integrity and authenticity of the received data.
- Apply update: apply the OTA to the device to make it operational.
Security operations can be applied at various points in the phases. For example, if the data is encrypted, it may be decrypted on receipt and stored in that form. Or it may be stored in encrypted form and decrypted by the bootloader. Either way, decryption requires proper key management.
The download phase may be managed in several ways. One is a trickle-download, where small amounts of data are downloaded a bit at a time. This allows the download to occur as a background operation during normal device operation without affecting it. Another is a high-speed download, where all the data is downloaded as fast as possible. That may require taking the device out of service for the duration of the download, but ensures a timely update. This may be necessary if communication windows are limited, such as orbital position for a spacecraft relative to ground stations.
Verification includes both verifying that the stored contents of the download are error-free, for instance via a checksum such as CRC or Fletcher, and verifying that the data comes from an authorized provider, for instance via digital signature.
Once the OTA has been verified, it’s ready to be applied to the device. This may happen automatically, or if there is a user involved, at the user’s discretion. Applying the update typically takes it out of service briefly, since it may involve storage manipulation and resetting the device.
Support for OTA may be built into the application, the bootloader, or both.
OTA needs to be rock-solid. It needs to tolerate a number of failure scenarios.
If OTA doesn’t work, the device can’t be updated remotely. Worse, the device may no longer function at all. This is known as “bricking”: the device has all the technical functionality of a brick.
The communications medium is the main challenge. Data communications are inherently unreliable. The medium may be subject to interference or a high error rate during a connection. The connection may be lost entirely.
Communication protocols have a variety of mechanisms to make them more reliable. This involves retrying failed operations, for instance re-connecting a lost connection, or resending a failed transmission. Unless the communications path has been permanently damaged, such as a cable cut, the OTA will eventually complete. However, that may take longer than expected. The retry traffic may impose a heavier load on the communications infrastructure than expected, potentially doubling or tripling the total bandwidth consumption.
One important capability is resumable download. The download shouldn’t have to restart from scratch each time the connection fails. The download should be able to resume from where the last stored data left off; depending on communications protocol and how resumption is managed, the download may have to deal with portions arriving out of order. That way, no matter how poor the connectivity, each connection will make some forward progress and the download will eventually complete.
The download should include error detection for each chunk of data sent, commonly implemented by a checksum such as CRC or Fletcher. This allows chunks to be retried individually on error rather than repeating the whole download. Then the integrity check of the entire download verifies that all the data was received and processed correctly.
The next challenge is applying the update. Once the update has been committed and the device has been reset to run it, what if the device crashes? You want to avoid a “boot loop”, where the device applies an update and then continuously reboots.
Power is another consideration. There may be critical points during the download or while applying the update where power loss can corrupt the OTA.
Various other issues can interfere with OTA. Storing or reading back the download may fail, or other glitches or bugs may occur.
Protecting Against Failure
If sufficient storage space is available, one way to protect against failed updates is to keep a known good version available at all times for fallback. This could be the previous version (sometimes known as A/B versions), or some “golden version” used for factory reset. The system needs to have a mechanism for detecting a boot loop or other condition to trigger fallback.
The Consequences Of Failure
Failed OTA can turn into an expensive situation. For a remote spacecraft or other inaccessible devices, the mission may have to be abandoned. For a consumer product, it may require an expensive recall or return, possibly for millions of devices.
Communication networks are the modern Wild West. Connecting any device or server immediately exposes it to all kinds of threats, even if the communications medium is supposedly secure. The three main concerns are data breaches, hijacks, and DoS (Denial Of Service).
Data breaches during OTA risk exposing user data or your IP (Intellectual Property). Hijacks risk putting unauthorized code on your devices, taking over control of them for other uses. DoS risks interfering with the OTA process or corrupting data so that updates fail, possibly bricking devices.
Both the devices and the servers are vulnerable. Data needs to be protected both in-flight and at-rest. In addition, the devices and the servers need to be able to identify and authenticate each other. Allowing an unauthorized device to download data from the server risks theft of IP, or data gathering that may lead to other attacks. Allowing a device to download from an unauthorized server risks bricking or hijacking it.
Devices should be individually keyed and authorized so that breach of a single device doesn’t undermine the entire system.
Update data should be encrypted so that only the device can use it. It should be digitally signed so that it can be verified as coming from an authorized provider (this protects against unauthorized updates that have been placed on authorized servers).
The scope of an OTA refers to how much of the system is updated:
- Full OTA: Updates all components of the system. The bootloader (or an initial bootloader stage) may be excluded, to ensure it always remains functional.
- Selective OTA: Updates only selected components.
- Delta OTA: Updates only changed regions of components.
- Patch OTA: Updates only specific values in components. This could be as small as a single byte of data.
Each of these involves progressively smaller amounts of data. This may be motivated by the communications medium. If it’s expensive or error-prone to send data, it’s generally better to minimize the scope. For instance, it’s difficult and expensive to beam large amounts of data to Mars. For Pathfinder, the fix was a patch that changed a setting in the image.
More practically, a medium such as a cellular network data connection may have a byte-metered cost (i.e. you have to pay for data usage). Large amounts of data scaled out to large numbers of devices can get expensive. (More about data, usage and updating schemes)
This presents a tradeoff with the overall complexity of the process. Delta update is more complex than full update because the delta data is specific to a particular version change, and the update capability needs to manage applying it. This can be like fixing the engine of your car while you’re driving it.
While delta update is similar to patch update, patch is typically a simple overwrite of a small amount of data with data of the same size (subject to flash page management requirements). For instance, a few bytes or words of data, possibly via scripted CLI operations (more about Command Line Interfaces ). Depending on specific capabilities, delta can potentially replace data with a different amount of data, and can be much larger. For instance, a delta update may replace 50% of the image, and add another 10% more data. The process of building, packaging, sending, buffering, and applying delta updates can therefore be more complex than patches.
Another consideration is whether there are multiple MCU’s (MicroController Units) that are updateable. In a multi-MCU design, one MCU will typically provide the communications path for updating the others. Each MCU may have a different update strategy.
Performance is important on both the device and server sides.
For the device, performance affects how long the download takes, how long it takes to apply the update, and how much the process affects normal operation. These are factors in how long the device is out of service.
The download depends heavily on the communications medium. It needs to complete in a reasonable amount of time for the product, but that might be days for a low-bandwidth connection doing a trickle download.
For the server, performance and resource consumption affect the server sizing. It needs to maintain a network connection to each device performing a download. This may result in many concurrent long-duration connections.
The device has to have guaranteed persistent storage available to receive the update data. It’s possible to support multiple updates, with a selection process identifying the version to run at any time.
Most current modern devices use flash memory as their persistent storage. This may be managed as raw blocks and pages, or as a virtual disk drive (an SSD, Solid State Disk). Flash memory may be built into the device MCU, or may be accessible over a bus such as SPI.
Flash memory has its own quirks, such as how much space can be erased or written at a time, how many times it can be updated, and a map of bad blocks identified during manufacturer testing. It may be managed as a single addressable region, or partitioned into multiple regions. Using raw flash, OTA data may be stored in raw blocks, or may be stored using an FFS (Flash File System) that manages some of the quirks.
The servers and their communications infrastructure need to be able to support OTA to multiple devices, possibly in the millions. They need to handle concurrent downloads, possibly with scheduling of groups of devices to spread the system and communications load over time. That may include canary updates, rolling updates to groups of devices to verify successful operation before proceeding to wider deployment.
They also need to handle multiple fielded versions. All the devices that need to be updated aren’t necessarily on the same version. This can be particularly true of consumer devices, where users haven’t necessarily applied all previous updates. That can complicate delta OTA, possibly requiring full OTA of devices that are too far out-of-date.
Updates to some devices may be problematic due to poor communications, resulting in significant retry load. Inevitably in large deployments, there will be some devices that are never updated. They may no longer have power or connectivity, or they may have been damaged by external events.
If the scale of the device fleet grows over time, which is typical for a consumer product, system characteristics will be constantly changing. Each 10x step in growth can produce significant different behavioral effects. What wasn’t a problem at 1000 devices can become a problem at 100,000. Server instance scaleout can address this to some extent in modern cloud systems, but it may be necessary to re-architect the servers over time for cost effectiveness.
The download represents the bulk of communications for OTA, but the control messaging is important. The overall protocol needs to be able to inform the server of the current version of data on the device and handle connection initiation and download resumption, all in a secure manner.
Communications may require flow control in order to pace out data transfer across multiple devices. It must also handle failed downloads that start but never complete.
When To Implement OTA
Because of the complexity and potential cost of failure, it’s best to implement OTA as soon as possible in a project and then make full use of it for the duration. That will offer more opportunities to test it under real-world conditions. Dojo Five can work with you to implement and test it.
Like security, OTA is not something that can be bolted on at the end. It needs to be architected into the system from the beginning. It has ripple effects throughout the system. It needs to be an engineered solution end to end, trading off a number of different factors.
The need for security and servers means that there’s a significant operational component to managing OTA. It’s not simply a development task. Organizations need to be prepared for that operational side. That also implies long-term support.
Need help designing firmware that supports secure and efficient OTA updates such as partitioning, version control, update integrity, rollback mechanisms, and multiple hardware compatibility configurations? Dojo Five can help. Drop us a line–we’d love to hear about your project!