WARNING: this project is in an ALPHA stage, not advertized yet. It might already be worth looking at, but the progress status statements are not reliable, and the API-s are not stable yet.
DOC + LIB + VIP + PERI
This project provides the following parts:
- Tightly Coupled Bus documentation,
- reference interconnect library,
- reference verification library (VIP),
- reference peripherals.
TCB (tightly coupled bus) is named after TCM (tightly coupled memory), since it generalizes the SRAM (synchronous static RAM) interface to support peripherals. The purpose of TCB is to fill a niche for a low complexity system bus without unnecessary limitations on throughput.
TCB is a good fit for small microcontrollers containing processors with a simple short in-order pipeline and combinational request signals (address, write data, ...).
TCB is not a good fit for interconnects with large access delays (external memory controllers, CDC in the interconnect, ...), or interconnect where out-of-order access can provide a performance advantage.
NOTE: Except for the handshake and response delay (see below), all features are optional. To avoid competing non standard solutions the reference implementation attempts to document and implement as many features as possible. Still the standard should be flexible enough that that it should be possible to interface with any custom implementation with minimal glue logic.
The SRAM interface is designed to allow a read or write access every clock cycle. Read data output (response) is available in the clock period after the address input (request) is sampled with the rising edge of the clock. This would be a 1 clock period request to response delay. There are no idle clock cycles when switching between read and write accesses.
Control signals use the standard VALID/READY handshake, which allows developers familiar with AMBA AXI to immediately understand it without reading any documentation.
Since the request to response delay is fixed, if the response can't be provided early enough, the request must be stalled using backpressure (READY).
While the SRAM interface request to response delay is usually one clock cycle,
this can be generalized to any integer delay DLY (values 0, 1, 2 would be common).
DLY=0is used for peripherals (explained below),DLY=1is used for TCM SRAM, and therefore by the CPU (matched delay maximizes throughput),DLY=2would be used where the read data clock to output delay is high enough to require an additional pipeline stage.
Typical ASIC/FPGA SRAM has a low setup time for request inputs (control, address, write data), most combinational delay is seen as the clock to data delay for response outputs (read data).
While not strictly part of the protocol, it is recommended to write the TCB interconnect with similar timing characteristics. This allows for a good timing balance without the need for register retiming during synthesis, thus avoiding undesired effects of retiming on signal toggling and consequent power consumption impact.
This can be achieved by writing peripherals with DLY=0,
where write access is synchronous, but reads are combinational.
A DLY=1 TCB interface is then achieved by placing a pipeline stage at the request input side.
IMAGE block diagram
TODO manager timing, call the above subordinate there is a combinational path from response to request address/data wandering if some paths can be declared false, like a peripheral response would not be used in the calculation of a new address? Cant be checked with an assertion, it depends on the running program.
Instead of the common approach of using byte enable signals to position data smaller than the bus width, TCB can use the logarithmic size of data (same as RISC-V ISA).
Byte enable signals and the related data byte multiplexer (mapping bytes between different positions within the memory bus to the right of CPU registers) are necessary for interfacing between a CPU and SRAM. But the data byte multiplexer is not necessary for accessing peripherals, since they are usually only accessed with the full bus width. Skipping the data byte multiplexer on the path between the CPU and peripherals can improve timing and dynamic power consumption.
When using this approach, the CPU LSU uses the logarithmic byte mode. The data byte multiplexer becomes a part of the system interconnect.
IMAGE block diagram.
TCB can perform RISC-V AMO instruction accesses to peripherals with DLY=0
in a single cycle within the interconnect.
This can be used to implement thread safe access to peripherals.
For example in a GPIO peripheral individual data output and output enable register bit can be set/cleared, without necessitating a read modify write sequence (not thread safe), or custom approaches (for example Xilinx SoC/MPSoC).
The same can't be done with SRAM. In this case TCB provides an arbiter locking signal, Which prevents the arbiter to the current manager till a non locking access is performed.
The same locking approach can be used to implement atomic accesses for data wider than the bus data width. For example RISC-V load/store pair instructions (accessing 64-bit counters on a 32-bit data bus, ...). Another example would be burst transactions.
TCB utilizes many approaches to reduce dynamic power consumption, the main are:
- Placing pipeline stage registers at the peripheral module inputs, this prevents the propagation of address and write data signal toggling through the peripheral control logic when it is not accessed, thus minimizing dynamic power consumptions in address decoders, read data multiplexers, ...
- Request signal explicitly indicating repeated access to the same address, can be used to skip SRAM reads, since common ASIC/FPGA SRAM holds the last read data till the next access. This can be used during a stalled instruction fetch while avoiding a register holding the last fetched instruction.
- Splitting or read and write address prevents toggling in the read data multiplexer, while performing a write, and vice versa.
- Only active data bytes should toggle during a read/write access. Read data multiplexers have separate select signals for each byte, select signals for inactive bytes do not change from the previous access. This is not ideal when the last access was to a volatile location like a counter, further research is needed to handle this use case. A SRAM implementations is desired, where read logic can be enabled for individual bytes.
TODO: Write address decoder
Not strictly part of the standard, but a feature of the reference library implementation.
- Support for little/big endian.
- Support for misaligned access (including single cycle).
- Support for zero overhead sequential misaligned accesses (RISC-V instruction fetch with C extension) (research in progress).
- Support for data bursts (research in progress).
WHile having similar complexity as APB, it provides double the throughput and additional features.
Compared to AHB and Wishbone, it has a more straight forward handshake and no wasted cycles when switching access direction.
Compared to OBI1 (RISC-V Ibex CPU, and other Pulp Platform designs), the response delay is fixed, thus making the response path logic simpler. If a peripheral is unable to respond immediately, OBI1 would delay the response channel ready signal, while TCB would apply backpressure to the request handshake. Since either provide out of order support, the throughput should be the same (I state this without providing a proper proof).
Compared to AXI-Lite, a single channel handshake is just much simpler. AXI4-Lite also does not provide a mechanism for requesting a read narrower than the data bus width (AXI5-Lite provides this feature). While the multiple channels in AXI-Lite allow for synamic power reduction due to using only the necessary channels, a similar result can be achieved with TCB. The main advantage of AXI-Lite would be throughput in CDC scenarios.
None of the standard busses known to me provides a logarithmic size mode.
The RTL and VIP library are written in SystemVerilog, extensively utilizing interfaces and parameterization. Parameters are validated at compile time, checking for compatibility between modules and interfaces (bus width, protocol modes and features, address space size, ...). VHDL might be an option in the future.
Many features have been validated on 2 RISC-V processors and GPIO and UART peripherals, many features have not been fully validated yet.
Overall the documentation and reference implementation are a work in progress, and getting feedback would speed up this process.
While some aspects of the protocol have been validated, some require further research:
- Handling errors. Should response error codes be standardized? Should some errors be detected during the request phase (misalignment)?
- Burst semantics.
- Handling of endianness in bursts, especially when the the burst traverses one or more changes in data bus width.
- A proper definition for the read value hold feature.
- Minimizing dynamic power consumption in write/read logic, minimizing address changes to reduce decoder and multiplexer power, with the additional requirement to avoid volatile data (counters, GPIO inputs, ...) from propagating while the bus is idle.
- A systematic approach to parameter validation.
- Assertions for checking features like read hold.
- Validation of features for repeated address access, incremental address access, partially overlapping incremental access (RISC-V instruction fetch with C extension), ...
- ...
| status | description |
|---|---|
| planed | The component should be a part of project, but there is no implementation yet. |
| VIP | (Work In Progress) The implementation is partially done, or it might require |
The interface and package are shared across the RTL and verification code for the rest of the code.
| module | status | description |
|---|---|---|
tcb_if |
done | SystemVerilog interface. |
tcb_pkg |
done | SystemVerilog package. |
| module | status | description |
|---|---|---|
tcb_vip_pkg |
VIP | Package containing manager/monitor/subordinate code. |
tcb_vip_memory |
VIP | Multi port memory model. |
tcb_vip_tb |
VIP | Testbench for core VIP functionality. |
| module | status | description |
|---|---|---|
tcb_lib_passthrough |
done | Trivial passthrough. |
tcb_lib_register |
planed | Register for request/response paths. |
tcb_lib_connector |
planed | Interface connector with automatic handling of parameter differences. |
tcb_lib_arbiter |
done | Priority arbiter. |
tcb_lib_multiplexer |
done | Multiplexer of multiple managers. |
tcb_lib_decoder |
done | Address decoder. |
tcb_lib_demultiplexer |
done | Demultiplexer of multiple subordinates. |
tcb_lib_error |
done | Error response leaf subordinate. |
| module | status | description |
|---|---|---|
tcb_gpio |
WIP | GPIO controller. |
tcb_uart |
WIP | UART controller. |
- QMEM specification and IP
- Open Core Protocol on Wikipedia
- CoreConnect from IBM on Wikipedia](https://web.archive.org/web/20090129183058/http://www-01.ibm.com/chips/techlib/techlib.nsf/products/CoreConnect_Bus_Architecture)
- Ibex RISC-V core load/store bus interface
- NeoRV32 RISC-V core bus interface
- AMBA on Wikipedia
- AMBA on ARM
- Wishbone B4
- Pulp Platform Snitch Reqrsp Interface
- OpenHW Group OBI (OpenBus Interface)
- TileLink 1.8.0, 1.8.1
- OpenTitan TileLink IP
- https://github.com/pulp-platform/obi?tab=readme-ov-file
- https://github.com/openhwgroup/obi/blob/072d9173c1f2d79471d6f2a10eae59ee387d4c6f/OBI-v1.6.0.pdf
- Implementation of a PMP unit in a RISC-V commercial core (PMP/PMA unit)