Dynamic loadable modules in lightweight embedded systems

The advantages of independent modules that are loadable individually and on demand has long been recognized and widely used on larger computer systems. It provides a high level of flexibility in the range of functionality a particular system can provide. Instead of requiring all functionality to be included in one monolithic code base, the system has access to a large set of modules that are stored somewhere highly optimized for low cost long term storage. By loading modules from this storage into high premium program space only when needed, the system is able to offer extended functionality while keeping the costs for premium resources under control. Using this methodology allows programmability and flexibility that is limited only by the number of modules offered and the amount of long term storage space available to store them on. All standard operating systems from mainframes to PC’s and smartphones are based on this basic principle. 

Although this approach is widely used on larger systems, lightweight embedded systems typically lack the resources and level of hardware support required to implement this type of architecture. Most industry standard implementations that support loadable modules heavily rely on the presence of large amounts of code and data memory and built-in hardware support like an MMU (Memory Management Unit) to provide virtual addressing capabilities. Although the costs for processors continue to fall while capabilities increase, the tradeoff between higher end processors vs lightweight solutions will continue to persist as factors such as cost, power consumption and size will continue to be key factors in design decisions for products that are always expected to do more for less: less money, less battery power and smaller form factors for IOT and wearable devices.

As a result, most embedded systems that are designed around lightweight processors typically offer very limited and linear functionality. Lacking capabilities as described above, the firmware is implemented as one large monolithic system with functionality dedicated to a few main tasks and no ability to extend or change services without reprogramming the entire code base and then only when the device is actually able to update the firmware.  The technology detailed in this document targets these lightweight systems and describes methods to implement programmability through loadable modules without the need for resources and capabilities only present on larger processor systems.

The ELF Executable and Linkable Format

Although each processor manufacturer provides proprietary solutions, development environments and additional tools to aid in development of embedded devices around their products, pretty much all of them rely heavily on, or at least support, the use of the GNU toolchain to perform code compilation and linking into binary images suitable for execution on the target system. A large variety of toolchain builds are available, targeting many processor architectures for many different types of operating systems. The most common output format produced by the toolchain is ELF (Executable and Linkable Format). Although an ELF file may contain all the information needed to load and relocate a program image into memory, the format itself is not particularly designed nor suitable for use in lightweight systems.

Since most of the toolchain code is available as open source, one approach to producing a better suited binary image would involve modifying the toolchain software to produce an alternative binary format. However, this approach is highly architecture dependent and would require changes to rather complex source code for many different architectures and involve producing separate toolchain builds. Aside from de development efforts involved, this approach would likely face logistical challenges in code maintenance and distribution alongside the conventional builds. In addition, some vendors provide modified versions of the toolchain that include proprietary elements which are not available as open source. However, it is worth noting that, if the new format would become more widely accepted as a binary format alternative to ELF, implementing directly in the toolchain could become a more viable option in the future.

The current approach takes advantage of the fact that each architecture may produce different binaries, they all produce an ELF file which format is highly common and reasonably architecture independent. The solution described in this document uses the ELF file as a basis to convert it into a proprietary FLM (Flash Loadable Module) file through a proprietary tool named the Flash Linker.


When a binary image is built, standard linkers will typically target the image to be loaded at a particular code and data address. Unless a true position independent image is produced as described below, loading the image in any other location without additional processing will prevent the image from executing properly. Depending on memory usage and characteristics these physical locations may not be available at module load time however. On larger systems this discrepancy can often be resolved simply by mapping actual physical addresses to the expected locations through the virtual memory mapping capabilities of the MMU. For various reasons beyond the scope of this document, this option may not always be a suitable solution even when an MMU is present however. In that case the actual content of the code and/or data segments of the image will have to be modified to adjust for the address changes, a process called relocation. When relocation will be required, the ELF file will contain all the information needed to perform the actual relocation process. Please refer to the (highly simplified) depiction of the ELF file. Larger systems will typically load the entire code and data segment directly into RAM and subsequently enumerate through the relocation sections to determine how and where address fix ups need to be performed and make the corresponding changes in the code and data locations in RAM accordingly. Once relocation has completed, the RAM section containing the code will be configured for code execution if applicable and control is passed to the entry location in the image as specified in the ELF file. For optimization purposes, certain systems may not load the entire image at once but rather page by page on an as needed basis. The basic relocation process remains the same however.

Screen Shot 2020-06-20 at 6.50.49 PM

Flash Loadable Modules

As mentioned above, lightweight systems are often required to produce a single monolithic image that is persistent and able to include all required functionality. This may be one reason why small systems often have a relatively large amount of flash memory compared to RAM. In order to produce a solution suitable for this environment, the technology needs to be able to take full advantage of this available code space and offer the ability to load modules into flash, hence the name Flash Loadable Modules.

Given that code and data space is often at a premium in lightweight systems, it is of utmost importance the FLM technology is able to make efficient use of the available flash and RAM space. The ability to load binary modules freely at any memory location, regardless of the initial target addresses used during build time, is therefore an essential capability the technology needs to be able to provide. Since the amount of memory is relatively small, lightweight systems often do not offer an MMU unfortunately which eliminates the relatively straightforward virtual to physical address mapping methods as an option to compensate for location changes. Instead, the FLM loader will have to perform relocation processing in order to load a binary image at a different address.

PIC - Position Independent Code

Most versions of the GNU toolchain provide options (-fpicand variants) to build Position Independent Code, generally referred to as PIC. Although this option sounds promising for building images that don’t rely on a particular load address, it is mostly designed to simplify the dynamic linking process with shared libraries that are intended to be shared between multiple processes. For most architectures however, a PIC binary image doesn’t allow true independent loading at any location without additional relocation processing. Moreover, most PIC implementations rely on a fixed relative offset between the code and data segments which is entirely unsuitable for use with lightweight embedded systems FLM is targeting. It is worth nothing however, that the latest versions of the toolchain for the ARM architecture do provide additional PIC options (--msingle-pic-base-mpic-registerand -mno-pic-data-is-text-relative) which not only eliminate the relative location dependency but also offer effective control of the data segment location at load time through a dedicated register which is ideal for use with FLM. Other processors may offer alternative solutions, but each implementation will be highly architecture dependent. The FLM technology is compatible with any implementation and able to take full advantage of these features to streamline the relocation process.

ELF deficiencies

In order to produce an image that can be relocated, it is important load-time relocation information is provided in the ELF file. If the toolchain for a particular architecture is able to produce a suitable PIC module like for the ARM processor as described above, the GNU linker will already include all the necessary relocation information as part of the normal PIC building process. For other types of modules however, the GNU linker does not expect relocation information will be required and omits this data from the ELF file unless the --emit-relocsoption is used. Assuming all the proper options are provided, the linker will produce an ELF file that contains the code and data segments and one or more sections with relocation information. Although all the essential information is available in the ELF file, the way the data is organized is not well suited for relocating code images into flash for the following reasons:

  • ELF files often contain many additional sections with extra information that is irrelevant for FLM and therefore making the files significantly larger than necessary. Not only would this negatively affect storage capacity on systems with very limited storage space, but also impact module load times and bandwidth requirements for modules downloaded from online sources.

  • ELF files are designed for systems where both code and data segments are loaded and executed from RAM. This allows loading of entire segments into RAM first and perform relocation afterwards by enumerating through the relocation sections in the ELF file and changing memory locations accordingly, as depicted below:

Screen Shot 2020-06-20 at 7.03.39 PM
  • While writing to individual locations in RAM are normal operations, flash memory is designed differently and requires write operations to be performed in pages where the page often needs to be erased first before writing an entire page at once and the page size differs from vendor to vendor, depending on flash sizes and technologies. This makes writing a segment to flash first and changing individual locations later a rather inefficient and lengthy process that would likely involve multiple writes to the same flash page, something that also should be avoided as flash technology only supports a limited number of write operations. Copying a segment to RAM first, performing the relocation and then writing the results to flash would be an obvious approach but on lightweight systems the amount of contiguous RAM required would be very significant. As it is envisioned that module loading will be performed on systems that are in full operation, reserving large sections of RAM while supply is limited and only needed for relatively infrequent module loading operations would be highly inefficient and likely have a significant impact on system performance and capabilities. 

  • A more feasible approach would involve loading a code segment one block at a time where each block is loaded in RAM and relocated before writing to flash. Since the block size is much smaller than the entire segment, impact on RAM use is much more manageable. Unfortunately, not only are relocation entries in the ELF file often split up into multiple sections, entries within a section are not listed in any particular order. As a result, relocating each block would require scanning all entries in all sections for relocations matching the block. Lightweight systems will not have the RAM to load the entire ELF file or even the relocation sections which means excessive reads from the file will be required for each block being loaded as depicted below:

Screen Shot 2020-06-20 at 7.05.54 PM

Flash Linker

The flash linker converts a regular ELF file to an FLM file which is much more suitable for loading binary images into flash on embedded systems with limited resources. The program collects all the relevant relocation entries from all sections in the ELF file and orders them by relocation address. It subsequently breaks the code segment into blocks based on a configurable block size and groups each block with all relocation entries pertaining to that block. All groups are subsequently written to the FLM file, following an FLM header. The relocation entries for the data segment are also grouped together but the segment itself is not split into blocks since the data can be loaded directly into RAM and relocated in-place. The resulting FLM file is roughly formatted as follows:

Screen Shot 2020-06-20 at 7.09.07 PM-1

The FLM header contains details about the ROM and RAM regions the binary image was originally built for while the segment headers indicate segment type, location, size and number of following blocks. The code segment header is followed by code sections where each section consists of the actual code block and code relocation entries. The flash linker will ensure that the combined size of a section does not exceed a specified size, making the code block smaller if more relocation entries are present. This allows lightweight embedded systems to allocate a RAM buffer of predefined fixed size, load each code section into the buffer and use the entries in the buffer to relocate the content of the code block. When completed, the modified code block can be written directly to a flash buffer and the embedded loader can move on and read in the next section. The same process is followed for the data section except that the RAM buffer is bypassed and the data block written directly to the target region in RAM and relocation is performed in-place.

Some embedded systems may prefer not to use an intermediate RAM buffer but rather work with a designated flash buffer directly. A flash buffer is a region of RAM that matches the exact size of the flash write page and is used to read, modify and write back to flash in order to update non-page aligned content. For those implementations, it makes more sense to fix the block size rather than the section size producing a format as shown below. 

The flash linker is able to produce FLM files matching this layout as well. Assuming the code block size is matching the size of the flash buffer, the embedded module loader can write the code block directly into the flash buffer and read to subsequent code relocation entries to perform relocation directly in this buffer. Once completed, the buffer content can be written to flash after which the loader can load in the next code block from the FLM file.

FLM compatible OS for embedded systems

Although FLM is a powerful technology that allows lightweight systems to incorporate dynamically loaded modules, it is most powerful when combined with an embedded OS that is fully integrated with FLM and able to take full advantage of the flexibility this technology provides.

Screen Shot 2020-06-20 at 7.23.18 PM-1