go-bitnet

CGO Go bindings for BitNet 1-bit LLM inference, with a native LangChainGo provider.

Run 1-bit quantized LLMs (BitNet b1.58) on CPU with no GPU required. Models use ternary weights ({-1, 0, 1}), enabling fast inference on commodity hardware.

Features

Low-level API -- Load models, tokenize, decode, sample, and generate completions directly
LangChainGo integration -- Drop-in llms.Model implementation for use in LangChain pipelines
Streaming -- Token-by-token streaming with stop string detection and context cancellation
Multi-turn chat -- ChatML template formatting with conversation history
CLI tool -- Interactive chat and benchmarking modes

Supported Models

Any GGUF model quantized with BitNet's i2_s format. Tested with:

Model	Parameters	Size on disk
microsoft/bitnet-b1.58-2B-4T	2B	~500 MB
Falcon3-3B-Instruct	3B	~750 MB
Falcon3-7B-Instruct	7B	~1.5 GB
Falcon3-10B-Instruct	10B	~2.2 GB
Llama3-8B-1.58bit	8B	~1.7 GB

Prerequisites

Go 1.22+
CMake >= 3.22
Clang >= 18
Python >= 3.9 (for kernel codegen)
huggingface_hub Python package (pip install huggingface_hub)

Quick Start

# Clone with submodules
git clone --recurse-submodules https://github.com/hodizoda/go-bitnet-cli.git
cd go-bitnet-cli

# Build static libraries (bitnet.cpp + llama.cpp)
make libs

# Download a model
make models  # downloads bitnet-b1.58-2B-4T (~500 MB)

# Run the CLI chat
go run ./cmd/bitnet-cli --model models/bitnet-b1.58-2B-4T/ggml-model-i2_s.gguf

Usage

Low-Level API

package main

import (
    "fmt"
    bitnet "github.com/hodizoda/go-bitnet-cli"
)

func main() {
    bitnet.Init()
    defer bitnet.Free()

    model, _ := bitnet.LoadModel("models/bitnet-b1.58-2B-4T/ggml-model-i2_s.gguf")
    defer model.Close()

    ctx, _ := model.NewContext(
        bitnet.WithContextSize(2048),
        bitnet.WithTemperature(0.7),
    )
    defer ctx.Close()

    output, _ := ctx.Complete("The capital of Sweden is", 64)
    fmt.Println(output)
}

LangChainGo Provider

package main

import (
    "context"
    "fmt"
    bitnet "github.com/hodizoda/go-bitnet-cli"
    "github.com/tmc/langchaingo/llms"
)

func main() {
    llm, _ := bitnet.New("models/bitnet-b1.58-2B-4T/ggml-model-i2_s.gguf",
        bitnet.WithMaxTokens(128),
        bitnet.WithLLMTemperature(0.7),
    )
    defer llm.Close()

    // Simple call
    resp, _ := llm.Call(context.Background(), "Explain quantum computing briefly.")
    fmt.Println(resp)

    // Streaming
    llm.Call(context.Background(), "Count to ten.",
        llms.WithStreamingFunc(func(ctx context.Context, chunk []byte) error {
            fmt.Print(string(chunk))
            return nil
        }),
    )
}

CLI

# Interactive chat
go run ./cmd/bitnet-cli --model path/to/model.gguf

# Benchmark
go run ./cmd/bitnet-cli --mode bench --model path/to/model.gguf --prompt "Explain gravity." --max-tokens 256

CLI flags:

Flag	Default	Description
`--model`	`models/bitnet-b1.58-2B-4T/ggml-model-i2_s.gguf`	Path to GGUF model
`--mode`	`chat`	`chat` or `bench`
`--prompt`	`Explain quantum computing in one paragraph.`	Prompt for bench mode
`--max-tokens`	`128`	Maximum tokens to generate
`--temp`	`0.7`	Sampling temperature

Build Targets

make libs           # Build bitnet.cpp static libraries
make models         # Download default model (bitnet-b1.58-2B-4T)
make models/falcon3-10b  # Download Falcon3-10B
make models/falcon3-7b   # Download Falcon3-7B
make models/falcon3-3b   # Download Falcon3-3B
make models/llama3-8b    # Download Llama3-8B
make test           # Build + download model + run tests
make bench MODEL=path/to/model.gguf  # Benchmark a model
make clean          # Remove build artifacts

Architecture

bitnet.go       CGO bindings wrapping llama.h from bitnet.cpp
options.go      Functional options for model/context/LLM configuration
langchain.go    LangChainGo llms.Model adapter
cmd/bitnet-cli/ Interactive chat + benchmark CLI
3rdparty/BitNet git submodule (microsoft/BitNet fork of llama.cpp)
build/          cmake output (gitignored)
models/         GGUF model files (gitignored)
patches/        Const-correctness patch for bitnet.cpp

Thread Safety

Model is safe to share across goroutines
Context is single-goroutine only (holds KV cache state)
Create multiple contexts from one model for parallel inference

Benchmarks

Measured on HP ProDesk 400 (i5-9500T, 24 GB RAM, Ubuntu 24.04):

Model	Tokens/sec
Falcon3-10B-Instruct	~5.4 tok/s

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
3rdparty		3rdparty
cmd/bitnet-cli		cmd/bitnet-cli
patches		patches
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bitnet.go		bitnet.go
bitnet_test.go		bitnet_test.go
go.mod		go.mod
go.sum		go.sum
langchain.go		langchain.go
langchain_test.go		langchain_test.go
options.go		options.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-bitnet

Features

Supported Models

Prerequisites

Quick Start

Usage

Low-Level API

LangChainGo Provider

CLI

Build Targets

Architecture

Thread Safety

Benchmarks

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-bitnet

Features

Supported Models

Prerequisites

Quick Start

Usage

Low-Level API

LangChainGo Provider

CLI

Build Targets

Architecture

Thread Safety

Benchmarks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages