Implementing Continuous Integration for Hardware Systems

Continuous Integration (CI) and Continuous Deployment (CD), are engineering process methodologies that automate code compilation, deployment, and testing. Extensively used in software development, it directly benefits engineering and product organizations by automating the merging the work of various contributors into a common code base, compiling the resulting code, automatically running tests on it, and providing feedback on the results. This routine merging and testing directly contributes to product quality by providing more rapid feedback on product regressions or defects, while simultaneously providing confidence in products updates. While hardware and software development often share many similarities, extending CI/CD to support hardware products can pose challenges, especially when working with complex products that have many dependencies and heterogeneous code bases whose compile times may be measured in days. This article discusses our experiences at Per Vices in implementing a CI/CD framework to support our Software Defined Radio products, including the design and technology choices we made to support a product that spans five completely separate repositories, and ensuring product quality across a number of client host systems. It aims to benefit and support engineers at organizations whose products include a hardware component, and who are looking to implement their own CI systems.

Introduction
Continuous integration and deployment are engineering process that aims to automate the technical and organizational methodology by which changes to code are propagated or made available to customers. In doing so, it aims to automate not only the compilation of software into binary form and its packaging, but also its testing and deployment to customers (see Figure 1). The benefits of implementing continuous deployment and integration systems have been demonstrated to dramatically reduce the time, cost, and organizational complexity, associated with deploying updates to users.

Figure 1. Continuous Integration process overview.

However, while many software organizations have realized the benefits of implementing CI processes, hardware companies often face unique challenges in doing so. In particular, ensuring device functionality often requires measuring product input and outputs, or heterogeneous computing environments.

This article discusses some of the challenges we faced in implementing CI infrastructure for the Crimson Software Defined Radio product. In particular, the focus is on the organizational and technological implementation that we used to support a heterogeneous computing environment whose compile times, when it came to RTL code, was sometimes measured in days. In particular, we aim to include the details that we wish we would have known prior to starting this, and which would have saved us considerable time.

Motivation
Before discussing the details, we found it useful to consider our two primary motivations in implementing CI infrastructure: To reduce the time between code contribution to product updates, and to improve Quality Assurance by automating product testing and reporting after updates. After successfully implementing our CI infrastructure, we also recognized additional benefits, including a reduction in the technical debt by formalizing the build and update process, and reductions in "mean time to resolution" for sales requests, bug fixes, and feature deployments.

In addition, the transparency of the process greatly supported technical contributors and leadership by providing objective, near real-time, feedback on the current technical state of products, bugs, and features. More importantly, technical contributions can be verified with real-world hardware, which allows both hardware and software engineers to make informed and effective design choices that streamline the entire process. The unique underlying system allows for testing to start earlier so that cost is further reduced in the testing phase. It allows us to test and validate not only the hardware design but also the software implementation and the whole system. In some cases, engineers can start designing for the target system bring-up, debug, and testing even before final hardware availability, saving weeks if not months of development time.

Ultimately, the implementation allowed us to release well tested product updates within a couple of days of the code being deployed - a timeline that was effectively limited by the multi-day compilation time associated with some of our product elements. The automated test suite and deployment structure also lead to a reduction in engineering overhead associated with bug fixes and customer problems.

Standardization of deployment and product updates also allowed us to automate a subset of self-tests to allow us to run comprehensive automated product testing on all assembled products. This led to the near elimination of product returns due to failures, and the fully automated generation of product self-testing reports. It also provided a mechanism for sales people to automatically record and collect product measurements based on customer feedback or requests.

Architecture and Environment
Working with a piece of sophisticated and bleeding-edge Software-Defined Radio (SDR) platform is not the same as developing an application designed to run on servers that are based on common x86 architecture. Not only is does the underlying hardware constantly evolve, but it requires coordinated development of customer facing APIs, FPGA firmware, and various embedded and microcontrollers environments. These challenges are unique to adopting the software CI concept to programmable RF hardware like an SDR platform.

It is worth stating that we benefited by our ability to use our own product within our comprehensive testing framework. Having said that, we believe this implement ion can be easily extended to other radio products, and, in doing so, allow other organizations benefit from the proven practices and principles of traditional software development to SDR, and accelerating software and hardware integration during development, improve product quality, reduce the time to feedback, and development risk.

Architectural Principles
We realized very early that developing a reliable, but extendable, product requires a flexible testing framework. Traditionally, hardware reaches its final state after shipping to the customer, but this is simply not the case for complex products and integrations: ensuring ongoing and consistent product maintenance, support, and updates, are often a large part of the requirements. As a result, we focused on an implementation that supported rapid feedback at different stages in the development cycle: design, integration, and testing.

To understand why this is so critical, it's worth reviewing our product architecture, shown in Figure 2. Our initial target Device Under Test (DUT), was our Crimson SDR product, which is comprised of dedicate radio front end boards controlled by an MCU, which are connected to a Firmware and an FPGA, and configured by users through an external API (libUHD). Our goal in implementing a CI system was to ensure that bugs within each component, as well as bugs in the operation or interfaces between components, would be detected during testing.

Figure 2. Block diagram illustrating heterogeneous product architecture, with multiple code bases and hardware environments.

Code Compilation
Technical contributions to any of the components mentioned above could result in various defects, and the speed by which we can identify and fix these defects depends on where they might be located. In addition, the time required to update the component, also plays a role in how quickly we are likely to detect it - components with long compilation times were previously subject to a fewer number of gated tests than components that were faster to deploy, which were subject to more frequent and continuous testing during development.

MCU
The MCU code is compiled using a Makefile, and the Atmel and ARM toolchain. This is used to generate the bootloader and application binaries used to program the MCUs, and also compiles the flashing utility, which that runs on the ARM processor and distributes the initial burning tool. As updates to the MCU are supported over the internal UART bus, this utility needs to be cross-compiled to run on the ARM processor. This component takes less than a minute to compile and deploy, and is responsible for ensuring the configuration of the Radio Front End. The flashing process is fairly consistent, but binary verification is required after writing to the MCU.

Firmware
The firmware code is compiled using autoconf tools, and contains the server responsible for interfacing between our application API (hosted on the customer computer), and our product. The server interfaces to all the components including MCUs, FPGA, web GUI and host system through UHD. It runs within an ARM environment, only requires the ARM toolchain to compile (normally taking 30-60 seconds), and is easy to install.

FPGA
The FPGA is the heart of the digital signal processing engine. This code runs on programmable logic, and communicates with all the high speed converters. It also interfaces with remote applications using multiple optical fiber connections. Compiling this code requires a proprietary compiler, including license management limits, and take between 6-36 hours, depending on the specific variant. Prior to use, the compiled bit stream needs to be verified to determine whether it meets timing.

UHD
The UHD API is a user-space library that runs on a host system and communicates with and controls all of the Per Vices SDR platforms including Crimson and Cyan. UHD provides the necessary software API used to transport user waveform samples to and from our Software Defined Radio platforms, as well as configure the various parameters (e.g. sampling rate, center frequency, gains, etc) of the target radio devices. It normally takes around 5 to 6 minutes to compile and install the UHD library.

OS
The operating system (OS) we use on all our radio platforms is a highly customized embedded OS based on the Yocto Project. It allows us to develop and fine-tune a lot of the features that do not exist on mainstream kernels. Some of the features are very important to an embedded OS. Since we are building Linux OS from its source, the build process takes around 1 to 2 hours depending on the number of packages included in the final image, but is changed relatively rarely.

Implementation
Our final implementation uses Jenkins, an open source CI platform, to implement the process flow described in Figure 3. In addition, we used a physical testing framework comprising of a device to be tested (DUT), and a known good product (test generator) to generate test inputs into the DUT, as shown in Figure 4. The measured results can then be compared to the actual results, and provide the basis for some of our functional tests.

Figure 3. Continuous Integration System process flow. Note that our implementation deliberately separates software and firmware compilation, packaging, deployment, and testing.

Figure 4. Completed Testing Implementation. This implementation separates the testing and integration infrastructure

Automated Compilation and Testing
In the event that one component is updated, the build process will compile the code, package it, deploy it to the DUT, and then run the test sequence. For example, a contribution to the FPGA code base triggers some automated test benches, followed by the compilation workflow. Next, the synthesized FPGA image can be programmed to a Crimson unit, along side the latest passing binaries from the other components. Meanwhile, the latest software can be fetched, compiled, and deployed (e.g. Firmware, MCU, UHD, and OS). At this point, the entire system can be fully validated and tested, using a fully independent test generator to send and receive programmed commands. This allows us to compare the performance of the system before and after making changes.

The complete transition to a Continuous Integration system was implemented in three phases, each culminating in a measurable outcome, allowing us to objectively monitor our progress towards a complete implementation.

Phase 1: Implement single command line builds and updates for all elements
The single purpose of this phase was to ensure a consistent build process for each component, and for product updates. This phase also laid the groundwork for subsequent stages. Ensuring code bases are compiled with a single command line (with whatever arguments are necessary), not only allows all developers to fully build the product, but it also helps ensure that build and product dependencies are fully specified.

Our build process also ensured that relevant compilation data (including build flags and versioning parameters), were stored within the resulting code, providing a mechanism we could later use to verify version control data. The resulting binaries also included a manifest, that listed files along with the checksums, to help support binary integrity management.

This did require us to also develop an update tool to automatically package and update the product. Ultimately, this was also extended by developers to allow for incremental updates directly to their test devices during development. This feature ended up being used quite widely by all developers, which helped ensure its ongoing maintenance.

OUTCOME: Organizational and technical requirement that all code bases be completely configured and compiled through a single command line, with appropriate flags as required.

Phase 2: Automate Builds and Product Updates
The purpose of this phase was to automate the building and deployment of updates whenever contributions were made. After trying a number of different tools, we found Jenkins to be the best suited for deploying to hardware, allowing us to easily introduce new build agents and nodes, and flexibly save and upload the various build artifacts. We used Jenkins CI to implement the workflow illustrated in Figure 3.

Within Jenkins, we extended the command line compilation programs to upload the build artifacts to an internal binary server. This allowed us to better store builds, and manage the various binaries included within a single build. This was especially useful when only building part of the code base, in which case we could easily reuse the existing components.

Build failures of any components resulted in automated alerting, though during initial tests we sent all alerts to a test account to avoid alarm fatigue during implementation. Successful builds resulted in a fully complete update package being saved, ready for manual or automated QA testing. In addition, we automatically pulled from all contributors branches, and merged all contributions, on an ongoing basis, to a -testing branch. In doing so, we aimed to catch any interoperability or interface bugs as quickly as possible.

OUTCOME: Require builds or merges to mainline to successfully pass automated compilation, and packaging, prior to automatic merging into a common staging area. The resulting update packages should be stored and easily accessible for the purposes of deploying them to hardware for testing purposes.

Phase 3: Develop Initial CI Tests
In our last phase, we developed and implemented the CI testing functionality. Our initial tests were specifically designed to be very lightweight and were able to be run entirely on the DUT, without requiring an external test controller. They included a number of very simple tests, such as confirming the correct product version after an update. This was ideal, as they were simple to develop, fast to test, and helped validate the update process as well as the functional test set up. As they were intended to be run exclusively on the DUT (without a test generator), they did not require much co-ordination or synchronicity. Failure on any of these tests immediately resulted in overall test failure - this avoided running extensive or time consuming tests on incorrectly updated products.

We subsequently developed initial functional tests for core features. We first ran relative tests, whose success confirm operation relative to initial test case. For example, when testing gain functionality, we apply a constant tone through out the test sequence. Then we configure the device for minimum gain, and record mean amplitude. We then increase the gain incrementally, by a measurable amount, and confirm a corresponding increase in gain, relative to the initial run, on every iteration. Then, after running the relative test sequence, we performed a single test run that compared the absolute amplitude for that tone to a valid range, which helped ensure that the device was, in fact, operating correctly.

This helped identify conditions where, for example, the operation of the attenuator was correct, but front end switch configuration was not. By separating out relative vs absolute test failures, we were able to better identify failure modes and mechanisms.

On successful completion of these tests, we automatically deployed update packages. The idea here is to effectively automate the QA tests may previously have been done manually. As a result, if the build package successfully passes these same steps in an automated manner, you should feel safe automatically releasing the build. We found, inevitably, that some bugs slipped through the test coverage. In such cases, we added additional functional tests, or extended coverage, to better detect this condition and automatically identify failures.

At least initially, we deployed passing packages to a scheduled queue, which waited a week prior to deploying the update. This allowed us to confirm correct operation of the functional tests, and included some performance charts. This allowed us to keep a human in the loop, with an automated report to review, and providing for a delay in which a person can look over the automated report, and, if necessary, delete or hold the build back from automatic release. The critical item here is that releases must be continuously deployed in the absence of manual intervention: the entire goal is to move away from a manual releases, so any manual intervention that is required to push an update or release runs counter to our aims. On the other hand, manual interventions to withhold, or prevent, a build from being released may be acceptable, provided that the test sequence is updated to avoid that happening again.

COMPLETION OUTCOME: Automatically deployment of update packages on successful completion of tests.

Conclusion
The lines between software and hardware are vanishing thanks to the underlying architecture and the adoption of rapid prototyping methodologies where these two worlds are tightly coupled in SDR. Adopting a CI system proved to be challenging, but we realized considerable benefits when implementing the system. This was especially true as it allowed us to benefit from a very flexible product, and use that to help test and validate technical contributions and the build process, and automatically alert us to regressions.

Implementing Continuous Integration for Hardware Systems

Report Abusive Comment