Back to blog

The Harness for High-Velocity Engineering: CI/QA as a Safety Rail

In the era of rapid-cycle software delivery, a reliable CI harness isn't a convenience — it's survival gear. The problem is that speed and isolation have always been at odds. Docker executors are fast but share the host kernel. Kubernetes pods add real isolation but then layer scheduling, image pulls, and pod startup — pushing the simple act of "start a build environment" into the 30 to 60 second range. Both approaches start from a clean slate: every dependency, every cache, fetched fresh as if the last thousand builds never happened.

Our Firecracker-based infrastructure breaks the trade-off. Hardware-level isolation, cold-start speed of a process, and a distributed P2P mesh that pre-warms dependencies before the VM comes online. It isn't tied to any single orchestrator — Jenkins, TeamCity, GitLab CI, or a standalone CLI all consume the same machine API. The system is a decoupled compute plane, not a plugin.

  1        +-------------------------------------------------------+
  2        |       Orchestrator (Jenkins, GitLab, TeamCity, CLI)   |
  3        |       (Consumes the Firecracker Machine API)          |
  4        +---------------------------+---------------------------+
  5                                    |
  6                           1. Request Provisioning
  7                                    |
  8          +-------------------------v-------------------------+
  9          |               Kubernetes / Machine API            |
 10          +-------------------------+-------------------------+
 11                                    |
 12                           2. Schedule to Node
 13                                    |
 14   +--------------------------------v-----------------------------------------+
 15   |                          Bare Metal Host                                 |
 16   |                                                                          |
 17   |  +-----------------------+           +--------------------------------+  |
 18   |  |  Firecracker Operator | <---------+  P2P Distribution Peer       |  |
 19   |  |  (Node Agent)         |           |  (Image & Cache Distribution)  |  |
 20   |  +-----------+-----------+           +---------------+----------------+  |
 21   |              |                                       ^                   |
 22   |      3. Provision VM                                 | 4. Hydrate Rootfs |
 23   |              |                                       |    & Cache Seeds  |
 24   |  +-----------v-----------+           +---------------v----------------+  |
 25   |  |    Firecracker VM     |           |        Mountpoint S3           |  |
 26   |  | +-------------------+ |           | (Fleet-wide Object Backing)    |  |
 27   |  | | Custom Init Binary| |           +---------------+----------------+  |
 28   |  | +-------------------+ |                           |                   |
 29   |  | |   Build Agent     | <---------------------------+                   |
 30   |  | +-------------------+ |           5. Virtio-FS Zero-Copy Mount        |
 31   |  +-----------------------+                                               |
 32   +--------------------------------------------------------------------------+

The 10ms Boot

Standard Linux distributions boot too slowly for ephemeral tasks — systemd, udev, getty spawning login terminals nobody uses. We wrote firecracker-agent-init, a custom Go binary that the kernel loads directly as PID 1. It mounts /proc, /sys, and /dev in a few microseconds, then bridges the Firecracker Vsock interface straight to the Jenkins agent.jar process. From kernel load to connected agent: under 10 milliseconds.

Firecracker acts as a direct ELF loader for the Linux kernel, so we inject raw boot_args — init=/sbin/my-custom-init to skip the initramfs chain, pci=off to disable hardware scanning. For the fastest cold starts, we use monolithic kernels with all necessary drivers baked in. For workloads that need XFS or encrypted volumes, we load kernel modules dynamically from the rootfs or a shared Virtio-FS mount.

Zero-Copy Caching

A VM that boots in 10ms is useless if it then spends two minutes downloading dependencies. We map the host's cache directories — .m2, node_modules, .gradle — directly into the microVM's filesystem via virtiofsd and vhost-user-fs. Multiple concurrent builds on the same host share the exact same physical disk sectors. Not copies, not symlinks, the same sectors. No network round-trip, no disk duplication, near-native I/O throughput for every agent.

At fleet scale, the problem shifts from local sharing to cross-host distribution. We use AWS Mountpoint for S3 as the durable backing layer with a P2P caching mesh running across the cluster. When a node needs an artifact, it asks its peers first. S3 is hit exactly once per cluster to hydrate the first peer. The fleet becomes its own self-healing cache.

Multi-Kernel Build Matrix

The 10ms boot changes what's practical. Here is a Declarative Jenkins pipeline that builds and tests across 15 kernel variants in parallel, each in its own hardware-isolated microVM:

pipeline {
    agent none
    stages {
        stage('Multi-Kernel Test') {
            matrix {
                axes {
                    axis {
                        name 'KERNEL'
                        values 'linux-5.10', 'linux-5.15', 'linux-6.1',
                               'linux-6.6', 'linux-6.12',
                               'linux-rt-5.15', 'linux-rt-6.1', 'linux-rt-6.6',
                               'linux-cloud-5.15', 'linux-cloud-6.1',
                               'linux-arm64-6.6', 'linux-xfs-custom',
                               'linux-ebpf-6.1', 'linux-hardened-6.6',
                               'linux-lts-5.4'
                    }
                }
                stages {
                    stage('Build & Test') {
                        agent {
                            firecracker {
                                cpus 4
                                memory '8192M'
                                image "kernel-test-${KERNEL}"
                                kernelArgs "console=ttyS0 quiet panic=1 pci=off"
                                snapshotOnFailure true
                            }
                        }
                        steps {
                            sh 'make clean && make -j$(nproc)'
                            sh './run_tests.sh --kernel-version ${KERNEL}'
                        }
                    }
                }
            }
        }
    }
}

All 15 VMs boot simultaneously. The wall-clock time is bounded by the slowest kernel, not the sum of all 15. Every failure snapshots its VM state for later debugging. The pipeline DSL accepts kernelArgs and snapshotOnFailure natively — on failure, the runtime marks the agent, the driver passes flags to hostd, and the daemon pauses the VM via Firecracker's API socket, writes the full memory and state snapshot into the lease's observability/ directory synced to S3, and only then tears down networking, disks, and the process.

Time-Travel Debugging

When a build fails in a traditional CI system, the developer reads logs, guesses at the cause, pushes a fix, and waits for the pipeline to rerun. With snapshot support, the plugin issues PATCH /vm/state to freeze the microVM mid-failure and PUT /snapshot/create to capture its exact RAM and disk state.

The Jenkins UI shows an "SSH into Failure" button on the failed build. Clicking it boots a clone of that snapshot in under a second. The developer lands in a shell where the failure just happened — environment variables, temporary files, open processes, all frozen exactly as they were. They attach gdb or delve, inspect stack frames and heap state, test a fix interactively, and resume the pipeline from the corrected state. CI becomes a live forensics console, not a log dumper.

Density and the Operator

Resource efficiency determines how many agents a single host can run. Firecracker's balloon API reclaims unused memory pages from running microVMs — a VM allocated 8GB that only uses 1GB releases the remaining 7GB back to the host. This lets us safely overprovision without risking OOM kills, scheduling against actual usage rather than reservation.

The original HTTP controller (firecracker-hostd) handled single-node setups. For cluster scheduling, we built a Kubernetes Operator around a FirecrackerMachine Custom Resource Definition. The CRD specifies the VM image, CPU, memory, kernel path, kernel arguments, cache mounts, and snapshot policy. Jenkins applies CRDs to the cluster; node-local operators watch for new machines and instantiate Firecracker VMs on their host. The operator creates a companion Lease CRD to track the VM lifecycle — provisioning, ready, running, failed, released — with a observability/ path that collects logs, snapshots, and a cleanup-proof manifest. Standard Kubernetes scheduling — taints, tolerations, affinities, resource requests — drives placement across the bare-metal fleet without the overhead of heavier isolation layers.

REPL-Driven CI for AI Agents

For active development, we added a WebSocket endpoint to the operator that bypasses Jenkinsfiles. An AI agent sends a request — "Ubuntu 22.04, Java 21, PostgreSQL" — and gets back a shell session piped through Vsock in under 10ms. It streams git diff output into the VM, runs mvn test, reads the stack trace, patches the source, and reruns — all inside the same microVM without a single push to the remote. A typical session: provision the VM, write a failing test, watch it crash, apply the fix, confirm the pass, discard the environment. No branches, no PRs, no waiting.

Local Parity

CI failures that don't reproduce locally are a constant source of friction — a macOS or Windows laptop rarely matches the Linux environment where the pipeline runs. Since the plugin builds standard OCI images into Firecracker rootfs files, we ship a companion CLI: jenkins-fc. A developer runs jenkins-fc run --job "my-failing-job", which pulls the exact same immutable Firecracker base image used by the cluster, boots it locally (via the macOS Hypervisor framework or KVM on Linux), and executes the pipeline steps inside an identical microVM. Same kernel, same init, same cache mount layout — byte-for-byte identical to the cluster environment.

Security

Caching at scale introduces trust problems: an untrusted pull request must not poison the cache for the mainline. We enforce strict Trust Domains — caches produced in an untrusted context are cryptographically isolated from the trusted domain, and firecracker-hostd validates every cache export before promotion to the distribution mesh.

For process-level isolation, each Firecracker instance runs inside jailer, enforcing cgroups (CPU and memory caps per VM) and namespaces (network, PID, and mount isolation). Secrets never touch environment variables; the agent queries the Microvm Metadata Service (MMDS) at 169.254.169.254 for short-lived credentials scoped to that build. Arbitrary, untrusted code runs with hardware isolation and ephemeral, least-privilege access.