A GEMM (General Matrix Multiplier) Co-processor

This Project is a Martix multiplier Co-processor integrated with a RISC-V 3-stage pipelined processor with integrated UART peripheral. Currently, the core implements RV32IMZicsr ISA based on User-level ISA Version 2.0 and Privileged Architecture Version 1.11 supporting only M mode. The following are the key features of this project:

Key Features

32-bit RISC-V ISA core that supports base integer (I) and multiplication and division (M), and Zicsr (Z) extensions (RV32IMZicsr) with a custom GEMM Co-processor.
Supports 8-bit signed input matrix elements and 32-bit signed output matrix elements.
Supports Dual port data memory of 16 banks that is synthesizable in Vivado.
Supports user input for Matrix by both data memory and UART.
Support for any size of matrix by using Runtime tiling provided by our GEMM.h file.
Overlapping of configuration and computation stages in GEMM to hide the latency of the configuration.
Support for performance monitoring in hardware by CSR counters.
Comprehensive random testing of GEMM with different matrix sizes and element values.

System Design Overview

The GEMM Co-processor is a loosely coupled Co-processor that can be configured by its memory-mapped registers. So we don't need to change the core's datapath and control that's the case with tightly coupled co-processors and we can easily integrate it with existing cores We have used a Dual port data memory with a 32-bit interface with RISCV-V core and a 128-bit wide interface with GEMM. The main unit of GEMM consists of 16x16 systolic array of MAC units. The GEMM is optimized to overlap the the computations in case when we need to process multiple tiles for a matrix. The architecture also overlaps the configuration and computation activities of the coprocessor which enables us to approach the ideal performance gain of 256x as the matrix size increases.

The block diagram shows the connectivity of the core with memory, GEMM, and UART peripherals using the data bus. Instruction memory is a form of ROM and is built into the RISC-V core for our case.

Memory Map

The memory map for the Gemm and UART is provided in the following table.

Base Address	Description
0x8000_0000	UART
0x9000_0000	GEMM_A
0x9000_0004	GEMM_B
0x9000_0008	GEMM_C
0x9000_000C	GEMM_stride_A
0x9000_0010	GEMM_stride_B
0x9000_0014	GEMM_control
0x9000_0018	GEMM_DIM

Operations Overview:

The following diagram shows the all the components of the system and the operations performed on each stage.

Architectural Details:

Parameter	Value
Input size	8 bits (signed)
Result size	32 bits
DataFlow	Weight Stationary
Systolic Array Size	16 x 16
Core Size	8 x 8
Number of Cores	4
Data memory Banks (dual port)	16
Banks Port Widths	8 bits
Accumulator Size	16 x 32

Performance Comparison:

The table below compares the performance of the GEMM Co-processor with the RISC-V scalar core. The cycle counts for the RISC-V core are empirically calculated, assuming no control hazards, using the following formulas:

Loading matrices: MxK + KxN
Computation: 2xMxKxN
Storing the result matrix: MxN

A (M,K)	B (K,N)	GEMM cycles	Core Cycles (emp.)	Performance Gain
(60, 60)	(60, 60)	3119	442,800	142x
(60, 8)	(8, 60)	659	62,160	94x
(8, 60)	(60, 8)	248	8,704	35x
(8, 8)	(8, 8)	48	1,216	25x

Getting Started

The following Programs are needed for the usage of the Accelerator.

Vivado for usage on Hardware.
toolchain for compiling to binary.
GTKterm for displaying UART transmission.

Install RISC-V toolchain. These tools can be built by following the instructions in the corresponding links or can be installed directly by running the following command

sudo apt-get install -y gcc-riscv64-unknown-elf gtkwave

Check that these tools are installed correctly, by running riscv64-unknown-elf-gcc -v and gtkterm.

Using GEMM.h

Use the functions defined in [GEMM.h] (./Script/src/gemm.h). The MATMUL function is the GEMM matrix multiplication function and it handles all the configuration of GEMM in hardware and you just need to pass your matrices to it which you want to multiply and the matrix to store the result. The function accepts dimensions and the addresses of the matrix as arguments:

MATMUL( uint32_t A_rows, uint32_t A_cols, uint32_t B_cols,int8_t A[A_rows][A_cols], int8_t B[A_cols][B_cols], int32_t C[A_rows][B_cols]);

Compile the c-code using the following command(linux):

./Script/make.sh

Compile the c-code using the following command(Windows):

./Script/make.bat

This compilies the c-code and makes the files ICACHE.mem and memory files for the data memory banks that are to be read by instruction_memory and data_memory.

You can use the counter MCYCLE for checking hardware cycles taken by the function. These macros are defined in timer.h.

before calling the MATMUL function:

TIMER_START

After returning from the function:

TIMER_STOP

The value of cycles can be read into a 32-bit un-signed variable by calling the function:

read_cycles();

Using UART.h

To display using UART to gtkterm we first need to set some of the following parameters.

Select the Baud rate to be used. (Default: 9600)
Calculate the UART Baud divisor and write it to c-code. (Default: 1301)
Make sure of RISC-V frequency for calculations of Baudrate and Divisor.
Provide Access to the port for UART transmission for both Python and FPGA.

First, we select the Baud rate to display. Then calculate the Baud-divisor by using the following formula:

The baud divisor is then set in C-code by using the function:

Uetrv32_Uart_Init(uint32_t baud);

To print string onto UART use:

Uetrv32_Uart_Print(string);

To print 32-bit integers use:

UART_Send_32bit_number(uint32_t Number);

To Print Matrix to UART use:

void display_result_matrix(int rows, int cols, int32_t matrix[rows][cols]);

To receive input from UART use:

uint8_t Uetrv32_Uart_Rx(void);

All of the above is to be used in C-code.

Using Python_uart to transmit to UART.

Warning: We have implemented this on Linux. Windows permissions for ports can be tricky. Now we move to the Python script python_uart that we use to transmit our matrix to UART. We first need to install pyserial by:

pip install pyserial

Then define the Port and Baud rate that you have set in the python file. Then define the Rows(M)and column(K) of Matrix A and Rows(K) and Columns(N) of Matrix B. The python script will generate a matrix of the provided dimensions of 8-bit signed integer and transmit it to UART port. It also calculates the resultant matrix and prints to the terminal.

To run the Python command use:

sudo -E python3 python_uart.py

This allows the Python script to access the UART port and still use the non-root environment for Python.

Displaying to GTKterm.

Open gtkterm as sudo:

sudo gtkterm

Set the port and Baudrate in Configurations. Below is the resulant output of the Gtkterm.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
Script		Script
bench		bench
fpga		fpga
pdf		pdf
rtl		rtl
testbench		testbench
.gitignore		.gitignore
Config.sv		Config.sv
Makefile		Makefile
README.md		README.md
clk_divider.sv		clk_divider.sv
cycles_count.csv		cycles_count.csv
gemm.sv		gemm.sv
project.tcl		project.tcl
python_uart.py		python_uart.py
soc.sv		soc.sv
test.py		test.py
top.sv		top.sv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A GEMM (General Matrix Multiplier) Co-processor

Key Features

System Design Overview

Memory Map

Operations Overview:

Architectural Details:

Performance Comparison:

Getting Started

Using GEMM.h

Using UART.h

Using Python_uart to transmit to UART.

Displaying to GTKterm.

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

RehanQasim-dev/Optimizing-RISC-V-for-High-Performance-Matrix-Multiplication-with-Custom-GEMM-Accelerator

Folders and files

Latest commit

History

Repository files navigation

A GEMM (General Matrix Multiplier) Co-processor

Key Features

System Design Overview

Memory Map

Operations Overview:

Architectural Details:

Performance Comparison:

Getting Started

Using GEMM.h

Using UART.h

Using Python_uart to transmit to UART.

Displaying to GTKterm.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages