Skip to content

Boostedx/VexRiscV-AI_Accelerator

Repository files navigation

RISC-V & hls4ml Hardware Neural Network Accelerator

A bare-metal Hardware-Software Co-Design capstone project demonstrating a 78x Architectural Speedup by offloading neural network inference from a soft-core RISC-V processor to a custom FPGA hardware accelerator.

Project Overview

This project explores the extreme performance differences between general-purpose sequential software execution and spatially unrolled parallel hardware computing.

An entire System-on-Chip (SoC) was generated from scratch and synthesized onto a Terasic DE1-SoC (Intel Cyclone V) FPGA. The system features a 50MHz VexRiscV soft-core CPU orchestrating a custom-built Neural Network engine synthesized via hls4ml. By utilizing a memory-mapped 32-bit Wishbone CSR bridge, the host CPU completely offloads the mathematical workload of an MNIST digit classifier to dedicated FPGA DSP blocks. Key Achievements

100% Bare-Metal: Zero reliance on Linux or heavy operating systems. All firmware is written in pure C.

Flawless Silicon Execution: Achieved 94% accuracy on a 10,000-image dataset running natively on the Cyclone V fabric.

Massive Throughput: Accelerated inference from 116 FPS (CPU only) to over 9,000 FPS (Hardware Accelerator).

Architecture Overview

The system consists of a highly optimized Python-trained neural network that has been synthesized into generic Verilog Register-Transfer Level (RTL). It operates on a standard memory-mapped Control and Status Register (CSR) interface, making it CPU-agnostic.

Hardware Neural Network Specs

  • Input Layer: 14x14 Binary Image (196 pixels)
  • Hidden Layer: 32 Dense Neurons
  • Output Layer: 10 Output Classes
  • Precision: ap_fixed<16,8> (16-bit fixed point, 8 fractional bits) for ultra-high dynamic range without saturation.
  • Output Bus: 320-bit parallel bus (10 classes × 32 bits).
  • ASIC Ready: Uses perfect routing divisors to map cleanly to standard digital logic cells.

System Block Diagram

flowchart TD
    subgraph SoC ["Cyclone V FPGA (LiteX Soft SoC)"]
        direction TD
        
        %% Row 1: Memory and I/O
        SDRAM["SDRAM (64MB)<br/>[ Firmware in C ]<br/>[ 10k MNIST Data ]"]
        UART["UART Interface<br/>[ litex_term Profiling ]<br/>[ 115200 Baud Output ]"]
        
        %% Row 2: The Main System Bus
        WB["============================================================================<br/>32-bit WISHBONE SYSTEM BUS (50 MHz)<br/>============================================================================"]
        
        %% Row 3: Control & Bridge Components
        CPU["VexRiscV<br/>RISC-V CPU<br/>(No FPU)"]
        IO["Hardware I/O<br/>(Status LEDs)"]
        CSR["LiteX CSR Bridge<br/>(Memory-Mapped)"]
        TIMER["Timer0<br/>(32-bit)"]
        
        %% Row 4: Custom Hardware Accelerator
        HLS["hls4ml AI Engine<br/>-----------------<br/>In : 196 Nodes<br/>L1 :  32 Nodes<br/>Out:  10 Nodes<br/>16-bit Fixed-Pt"]
        
        %% Layout Connections
        SDRAM --> WB
        UART --> WB
        
        WB <--> CPU
        WB <--> IO
        WB <--> CSR
        WB <--> TIMER
        
        CSR --> HLS
    end

    %% Styling to keep it looking like a clean architectural diagram
    classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#000;
    classDef bus fill:#e6e6fa,stroke:#333,stroke-width:3px,color:#000,font-weight:bold;
    classDef custom fill:#d5f5e3,stroke:#333,stroke-width:2px,color:#000;
    
    class WB bus;
    class HLS custom;
Loading

The Hardware Handshake

Because the AI accelerator computes at extreme FPGA hardware speeds, it uses an Avalon-style stall handshake to securely pass data back to the slower software CPU without race conditions.

Load: CPU writes the 14x14 image to the AI's input CSRs.

Stall & Start: CPU asserts the stall wire and pulses start.

Compute: The AI core processes the image in pure hardware.

Hold: Once finished, the AI asserts done. Because stall is active, the hardware holds the done signal HIGH indefinitely.

Read: CPU detects done, reads the 320-bit output CSRs, and calculates the highest logit.

Release: CPU drops the stall wire, resetting the hardware for the next image.

Quick Start (FPGA Demo)

  1. Generate the Hardware IP: Run the Makefile synthesis script to generate the generic Verilog files.

    Make
  2. Build the SoC (LiteX): Navigate to gateware/ and use LiteX to stitch the AI IP to a softcore CPU and synthesize the bitstream (.sof).

    python3 soc.py --build --load
  3. Run the Firmware: Navigate to firmware/ and compile the bare-metal C code. Ensure your board is connected via UART to view the inference results!

    cd firmware
    make
  4. View the FPGA Run this command to view the boot-up sequence.

    litex_term --kernel firmware/firmware.bin --kernel-adr 0x40000000 --safe /dev/ttyUSB0
  5. Program the FPGA board with the .sof file

    quartus_pgm -c 1 -m JTAG -o "p;build/gateware/terasic_de1soc.sof@2"

Performance Metrics & Benchmarks

To validate the efficiency of the custom Verilog, the exact same neural network architecture (6,592 MACs per image) was profiled in two separate methodologies utilizing a hardware cycle-timer (Timer0):

Software Baseline: Emulated strictly in C arrays on the VexRiscV CPU.

Hardware Accelerator: Offloaded via the Wishbone bus to the hls4ml IP.
Metric Software Baseline (VexRiscV CPU) Hardware Accelerator (hls4ml IP) The Difference
Throughput (FPS) 116 Frames/sec 9,062 Frames/sec 78x Faster
Latency per Image 8,589 μs 110 μs Saved 8,479 μs per image
CPU Cycles per Image 429,496 cycles 5,517 cycles 98.7% Reduction in CPU load
Compute Performance 1.52 MOPS 119.47 MOPS Massive spatial parallelism

References & Acknowledgments

This project bridges the gap between machine learning and bare-metal hardware by leveraging several incredible open-source tools and frameworks. If you are exploring this repository, I highly recommend checking out the documentation for the following projects:

  • hls4ml: A Python package for machine learning inference in FPGAs. Used to translate the TensorFlow/Keras neural network into optimized, generic Verilog RTL.
  • LiteX: A highly flexible framework for creating FPGA SoCs. Used to synthesize the gateware, generate the CSR bus, and stitch the AI accelerator to the softcore CPU.
  • VexRiscv: A 32-bit RISC-V CPU architecture optimized for FPGAs. Acts as the host processor driving the Avalon handshake and embedded C firmware.
  • TensorFlow & Keras: The machine learning backend used to train the 16-bit, 32-neuron network.
  • The MNIST Database: The classic dataset of handwritten digits used to train and validate the hardware accelerator.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors