Skip to content

Add janus_cube_pyc: 16×16 systolic array matrix multiplication accelerator#2

Merged
zhoubot merged 11 commits intoLinxISA:mainfrom
fengzhazha:main
Feb 14, 2026
Merged

Add janus_cube_pyc: 16×16 systolic array matrix multiplication accelerator#2
zhoubot merged 11 commits intoLinxISA:mainfrom
fengzhazha:main

Conversation

@fengzhazha
Copy link
Contributor

@fengzhazha fengzhazha commented Feb 6, 2026

Summary

This PR adds a new hardware module janus_cube_pyc - a 16×16 systolic array matrix multiplication accelerator implemented in pyCircuit.

Features

  • 16×16 Weight-Stationary Systolic Array: 256 Processing Elements (PEs) arranged in a 16×16 grid
  • 16-bit Inputs, 32-bit Accumulators: Supports 16-bit weight and activation values with 32-bit accumulation
  • Memory-Mapped Interface: Integrates with CPU via memory-mapped registers at configurable base address
  • FSM Control: 5-state FSM (IDLE → LOAD_WEIGHTS → COMPUTE → DRAIN → DONE)
  • C++ Testbench: Includes testbench with identity and 2x2 matrix tests

Files Added/Modified

Source Files (janus/pyc/janus/cube/)

  • cube.py - Main module with FSM, PE array, and memory interface (357 lines)
  • cube_types.py - Dataclasses (CubeState, PERegs, FsmResult, MmioWriteResult)
  • cube_consts.py - Constants (states, addresses, array size)
  • util.py - Utility functions (Consts dataclass)
  • README.md - Documentation

Generated Files (janus/generated/janus_cube_pyc/)

  • janus_cube_pyc.v - Verilog RTL (~949 KB)
  • janus_cube_pyc_gen.hpp - C++ header (~1.1 MB)

Test Files

  • janus/tb/tb_janus_cube_pyc.cpp - C++ testbench
  • janus/tools/run_janus_cube_pyc_cpp.sh - Test runner script

Architecture

         Activations (16 × 16-bit)
              ↓ ↓ ↓ ... ↓
            ┌─┬─┬─┬───┬─┐
Weights →   │PE│PE│PE│...│PE│ → Results
(256×16-bit)├─┼─┼─┼───┼─┤
            │PE│PE│PE│...│PE│
            ├─┼─┼─┼───┼─┤
            │ :│ :│ :│   │ :│
            ├─┼─┼─┼───┼─┤
            │PE│PE│PE│...│PE│
            └─┴─┴─┴───┴─┘
              (16×16 = 256 PEs)

Implementation Highlights

The code follows pyCircuit JIT compilation patterns:

  1. Functions without @jit_inline execute at Python time (before JIT)

    • Used for register creation loops (_make_pe_regs, _make_weight_regs, etc.)
  2. Functions with @jit_inline compile to hardware

    • Used for combinational logic (_build_pe, _build_fsm)
  3. Dataclasses for return values avoid tuple unpacking (not supported in JIT)

    • FsmResult(load_weight, compute, done)
    • MmioWriteResult(start, reset_cube)

Testing

# Generate outputs
bash janus/update_generated.sh

# Run C++ testbench
bash janus/tools/run_janus_cube_pyc_cpp.sh

# With tracing
PYC_TRACE=1 PYC_VCD=1 bash janus/tools/run_janus_cube_pyc_cpp.sh

Known Limitations

  • Multiplication uses addition as placeholder (pyCircuit limitation)
  • Fixed 16×16 array size
  • Simplified memory interface (no AXI/AHB)

Shulin Feng and others added 6 commits February 9, 2026 21:52
Implemented a weight-stationary systolic array for matrix multiplication:
- 16×16 PE array (256 processing elements)
- 16-bit integer inputs (weights and activations)
- 32-bit accumulator for overflow prevention
- Memory-mapped interface for CPU integration
- Complete documentation in README.md

Features:
- Weight-stationary dataflow
- 32-cycle latency (1 load + 16 compute + 15 drain)
- 512-byte input buffers (Matrix A and W)
- 1024-byte output buffer (Matrix C)

Files:
- cube.py: Top-level module with FSM and integration
- cube_pe.py: Processing element (MAC operation)
- cube_array.py: 16×16 systolic array instantiation
- cube_buffer.py: Input/output buffer management
- cube_control.py: Control FSM (reference, not used)
- cube_types.py: Register group dataclasses
- cube_consts.py: Constants and memory map
- README.md: Complete documentation

Successfully tested MLIR emission (653KB output).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Relocated cube module to be alongside janus/bcc for better organization
- Updated all import paths from examples.linx_cpu_pyc.cube.* to janus.cube.*
- Updated README.md with new paths and references

The cube module now lives at janus/pyc/janus/cube/ alongside the
Janus BCC CPU implementation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Set the module name to "cube" for consistent naming in generated
MLIR/Verilog/C++ output.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Generate janus_cube_pyc.v and janus_cube_pyc_gen.hpp
- Update cube.py with correct __pycircuit_name__ = "janus_cube_pyc"
- Add janus_cube_pyc to update_generated.sh

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace m.in_wire() with m.input()
- Replace m.const_wire() with m.const()
- Update cube.py, cube_array.py, cube_control.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reduce cube.py from 2779 to 357 lines using for loops
- Add util.py with Consts dataclass for common constants
- Add FsmResult and MmioWriteResult dataclasses to avoid tuple unpacking
- Remove redundant files: cube_array.py, cube_buffer.py, cube_control.py, cube_pe.py
- Add C++ testbench (tb_janus_cube_pyc.cpp) with identity and 2x2 tests
- Add run script (run_janus_cube_pyc_cpp.sh)
- Update README with new file structure and JIT patterns

Key JIT patterns applied:
- Functions without @jit_inline execute at Python time
- @jit_inline functions compile to hardware
- Dataclasses for return values (JIT doesn't support tuple unpacking)
- return statements must be at top-level (not inside with blocks)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Shulin Feng and others added 5 commits February 10, 2026 20:15
- Add ARCHITECTURE.md with 15 detailed technical diagrams
- Add VISUAL_GUIDE.md with intuitive visual explanations
- Add IMPROVEMENT_PLAN.md for future development roadmap
- Add SystemVerilog testbench (tb_janus_cube_pyc.sv)
- Add run scripts for Icarus Verilog and Verilator
- Update README.md with documentation references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement cube_v2.py with 4-stage pipelined architecture
- Add 64-entry L0A, L0B, ACC buffers with 64-bit MMIO interface
- Add 64-entry issue queue with out-of-order execution support
- Add MATMUL block instruction decoder
- Create CUBE_V2_SPEC.md with complete architecture documentation
- Generate PDF specification using reportlab
- Update README.md and ARCHITECTURE.md with v2 details
- Add Verilog testbench tb_cube_v2.v

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PE module with @module decorator (cube_v2_pe.py)
- Add L0 Entry module with @module decorator (cube_v2_l0_entry.py)
- Add systolic array using m.instance() for 256 PEs (cube_v2_systolic_reuse.py)
- Add L0 buffer using m.instance() for 128 entries (cube_v2_l0_reuse.py)
- Add top-level module with module reuse (cube_v2_reuse.py)
- Update CUBE_V2_SPEC.md with optimization results
- Fix PrunePortsPass.cpp for LLVM 21 compatibility

Generated module structure:
- janus_cube_pyc (top)
  - L0Entry × 128 (64 L0A + 64 L0B)
  - CubePE × 256 (4 clusters × 64 PEs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@zhoubot zhoubot merged commit b489698 into LinxISA:main Feb 14, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants