A bare-metal Hardware-Software Co-Design capstone project demonstrating a 78x Architectural Speedup by offloading neural network inference from a soft-core RISC-V processor to a custom FPGA hardware accelerator.
This project explores the extreme performance differences between general-purpose sequential software execution and spatially unrolled parallel hardware computing.
An entire System-on-Chip (SoC) was generated from scratch and synthesized onto a Terasic DE1-SoC (Intel Cyclone V) FPGA. The system features a 50MHz VexRiscV soft-core CPU orchestrating a custom-built Neural Network engine synthesized via hls4ml. By utilizing a memory-mapped 32-bit Wishbone CSR bridge, the host CPU completely offloads the mathematical workload of an MNIST digit classifier to dedicated FPGA DSP blocks. Key Achievements
100% Bare-Metal: Zero reliance on Linux or heavy operating systems. All firmware is written in pure C.
Flawless Silicon Execution: Achieved 94% accuracy on a 10,000-image dataset running natively on the Cyclone V fabric.
Massive Throughput: Accelerated inference from 116 FPS (CPU only) to over 9,000 FPS (Hardware Accelerator).
The system consists of a highly optimized Python-trained neural network that has been synthesized into generic Verilog Register-Transfer Level (RTL). It operates on a standard memory-mapped Control and Status Register (CSR) interface, making it CPU-agnostic.
- Input Layer: 14x14 Binary Image (196 pixels)
- Hidden Layer: 32 Dense Neurons
- Output Layer: 10 Output Classes
- Precision:
ap_fixed<16,8>(16-bit fixed point, 8 fractional bits) for ultra-high dynamic range without saturation. - Output Bus: 320-bit parallel bus (10 classes × 32 bits).
- ASIC Ready: Uses perfect routing divisors to map cleanly to standard digital logic cells.
flowchart TD
subgraph SoC ["Cyclone V FPGA (LiteX Soft SoC)"]
direction TD
%% Row 1: Memory and I/O
SDRAM["SDRAM (64MB)<br/>[ Firmware in C ]<br/>[ 10k MNIST Data ]"]
UART["UART Interface<br/>[ litex_term Profiling ]<br/>[ 115200 Baud Output ]"]
%% Row 2: The Main System Bus
WB["============================================================================<br/>32-bit WISHBONE SYSTEM BUS (50 MHz)<br/>============================================================================"]
%% Row 3: Control & Bridge Components
CPU["VexRiscV<br/>RISC-V CPU<br/>(No FPU)"]
IO["Hardware I/O<br/>(Status LEDs)"]
CSR["LiteX CSR Bridge<br/>(Memory-Mapped)"]
TIMER["Timer0<br/>(32-bit)"]
%% Row 4: Custom Hardware Accelerator
HLS["hls4ml AI Engine<br/>-----------------<br/>In : 196 Nodes<br/>L1 : 32 Nodes<br/>Out: 10 Nodes<br/>16-bit Fixed-Pt"]
%% Layout Connections
SDRAM --> WB
UART --> WB
WB <--> CPU
WB <--> IO
WB <--> CSR
WB <--> TIMER
CSR --> HLS
end
%% Styling to keep it looking like a clean architectural diagram
classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#000;
classDef bus fill:#e6e6fa,stroke:#333,stroke-width:3px,color:#000,font-weight:bold;
classDef custom fill:#d5f5e3,stroke:#333,stroke-width:2px,color:#000;
class WB bus;
class HLS custom;
Because the AI accelerator computes at extreme FPGA hardware speeds, it uses an Avalon-style stall handshake to securely pass data back to the slower software CPU without race conditions.
Load: CPU writes the 14x14 image to the AI's input CSRs.
Stall & Start: CPU asserts the stall wire and pulses start.
Compute: The AI core processes the image in pure hardware.
Hold: Once finished, the AI asserts done. Because stall is active, the hardware holds the done signal HIGH indefinitely.
Read: CPU detects done, reads the 320-bit output CSRs, and calculates the highest logit.
Release: CPU drops the stall wire, resetting the hardware for the next image.
Quick Start (FPGA Demo)
-
Generate the Hardware IP: Run the Makefile synthesis script to generate the generic Verilog files.
Make
-
Build the SoC (LiteX): Navigate to gateware/ and use LiteX to stitch the AI IP to a softcore CPU and synthesize the bitstream (.sof).
python3 soc.py --build --load
-
Run the Firmware: Navigate to firmware/ and compile the bare-metal C code. Ensure your board is connected via UART to view the inference results!
cd firmware make -
View the FPGA Run this command to view the boot-up sequence.
litex_term --kernel firmware/firmware.bin --kernel-adr 0x40000000 --safe /dev/ttyUSB0
-
Program the FPGA board with the .sof file
quartus_pgm -c 1 -m JTAG -o "p;build/gateware/terasic_de1soc.sof@2"
To validate the efficiency of the custom Verilog, the exact same neural network architecture (6,592 MACs per image) was profiled in two separate methodologies utilizing a hardware cycle-timer (Timer0):
Software Baseline: Emulated strictly in C arrays on the VexRiscV CPU.
Hardware Accelerator: Offloaded via the Wishbone bus to the hls4ml IP.
| Metric | Software Baseline (VexRiscV CPU) | Hardware Accelerator (hls4ml IP) |
The Difference |
|---|---|---|---|
| Throughput (FPS) | 116 Frames/sec | 9,062 Frames/sec | 78x Faster |
| Latency per Image | 8,589 μs | 110 μs | Saved 8,479 μs per image |
| CPU Cycles per Image | 429,496 cycles | 5,517 cycles | 98.7% Reduction in CPU load |
| Compute Performance | 1.52 MOPS | 119.47 MOPS | Massive spatial parallelism |
This project bridges the gap between machine learning and bare-metal hardware by leveraging several incredible open-source tools and frameworks. If you are exploring this repository, I highly recommend checking out the documentation for the following projects:
- hls4ml: A Python package for machine learning inference in FPGAs. Used to translate the TensorFlow/Keras neural network into optimized, generic Verilog RTL.
- LiteX: A highly flexible framework for creating FPGA SoCs. Used to synthesize the gateware, generate the CSR bus, and stitch the AI accelerator to the softcore CPU.
- VexRiscv: A 32-bit RISC-V CPU architecture optimized for FPGAs. Acts as the host processor driving the Avalon handshake and embedded C firmware.
- TensorFlow & Keras: The machine learning backend used to train the 16-bit, 32-neuron network.
- The MNIST Database: The classic dataset of handwritten digits used to train and validate the hardware accelerator.