Fused LoRA Linear (PyTorch + CUDA)

One-kernel LoRA inference for linear layers: compute y = x·Wᵀ + (α/r)·x·Bᵀ·Aᵀ in a single pass to cut memory traffic and kernel launches. Drop‑in nn.Linear replacement with a fallback path, optional Triton kernel, and Hugging Face PEFT patch script.

Status: MVP. CPU/naive path ready; CUDA/Triton kernel stubs included. PRs welcome.

Why fuse LoRA?

Fewer launches: avoid separate GEMMs for x·Wᵀ and x·Bᵀ·Aᵀ.
Less memory traffic: no intermediate (x·Bᵀ) allocation.
Adapter‑heavy setups: bigger wins when swapping adapters at runtime (no pre-merge needed).

Features

FusedLoRALinear(in_features, out_features, r, alpha=1.0, bias=True, impl="auto")
Registers multiple adapters and switches them at runtime.
CUDA/Triton stubs to implement a fused kernel; CPU/PyTorch fallback works now.
Bench + correctness tests.
Example script to patch a Hugging Face model after applying PEFT LoRA.

Install

Requires Python 3.9+, PyTorch 2.x, CUDA (if you want to run with GPU).

git clone https://github.com/knk38/fused-lora-linear
cd fused-lora-linear
pip install -e .
# (Optional) build CUDA extension
python setup.py develop

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bench		bench
examples		examples
src/fused_lora_linear		src/fused_lora_linear
tests		tests
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fused LoRA Linear (PyTorch + CUDA)

Why fuse LoRA?

Features

Install

About

Uh oh!

Releases

Packages

Languages

License

knk38/fused-lora-linear

Folders and files

Latest commit

History

Repository files navigation

Fused LoRA Linear (PyTorch + CUDA)

Why fuse LoRA?

Features

Install

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages