ventus-pytorch is a Ventus-enabled PyTorch fork for running small-model inference on the Ventus software stack. This repository includes the PyTorch-side integration, Ventus kernel assets, and a minimal QEMU runtime image for reproducible execution.
Do not commit the runtime image into git. Publish it through GitHub Releases instead.
Recommended release assets:
ventus-runtime/ventus-runtime.qcow2ventus-runtime/Makefile
Users should download the runtime image from GitHub Releases, and place it at:
ventus-runtime/ventus-runtime.qcow2so that make qemu can find it directly.
The qcow2 image is prepared as a minimal runtime VM. On boot it:
- logs in automatically on the serial console as
root - enters
/opt/ventus - sources
./env.sh
For stable execution, allocate at least:
10CPU cores8Gmemory
rtlsim was built from a Verilator configuration using 8 threads, so undersized CPU allocation is not recommended.
The provided Makefile defaults are higher than the minimum.
From ventus-runtime/:
make qemuOr specify resources explicitly:
make qemu VM_CPUS=10 VM_MEMORY=8GAfter boot, the shell is already in /opt/ventus with env.sh loaded.
The runtime image keeps the execution environment small:
/opt/ventus/install//opt/ventus/ventus-pytorch/torch//opt/ventus/models/gpt2/run_generate.py/opt/ventus/models/pythia-410m-deduped/run_generate.py/opt/ventus/models/qwen2.5-0.5b-instruct/run_generate.py
Model weights are intentionally not bundled into the qcow2 image. Before
running a model, download its Hugging Face checkpoint files into the matching
directory under /opt/ventus/models/.
Examples:
/opt/ventus/models/gpt2/
/opt/ventus/models/pythia-410m-deduped/
/opt/ventus/models/qwen2.5-0.5b-instruct/Each directory should contain the model files plus the packaged
run_generate.py.
Native Ventus inference has been validated in three precisions:
GPT-2:TF32Pythia-410m-deduped:FP16Qwen2.5-0.5B-Instruct:BF16
Current reference results:
GPT-2remains the minimal text-generation smoke test.Qwen2.5-0.5B-Instructonrtlsimgenerated one token fromHello, my name isand producedHello, my name is Alexin about7:00:56.18.Pythia-410m-dedupedonrtlsimgenerated one token fromHello, my name isand producedHello, my name is Johnin about4:54:40.50.
Pythia-410m-deduped currently runs in FP16, but stability is not yet
sufficient for KV-cache-based decoding. Keep use_cache=False for this model.
With KV cache enabled, small numerical error can accumulate and propagate into
NaN values.
Inside the VM:
export VENTUS_BACKEND=spike
python models/gpt2/run_generate.py
python models/pythia-410m-deduped/run_generate.py
python models/qwen2.5-0.5b-instruct/run_generate.pyFor rtlsim, switch the backend before launching:
export VENTUS_BACKEND=rtlsim
python models/qwen2.5-0.5b-instruct/run_generate.py