Native llama.cpp bindings for Bare.
Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.
- CMake 3.25+
- C/C++ compiler (clang, gcc, or MSVC)
- Node.js (for npm/cmake-bare)
- Bare runtime
Clone with submodules:
git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cppOr if already cloned:
git submodule update --init --recursiveInstall dependencies and build:
npm installOr manually:
bare-make generate
bare-make build
bare-make installThis creates prebuilds/<platform>-<arch>/bare-llama.bare.
For a debug build:
bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make buildTo disable GPU acceleration:
bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make buildconst { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')
// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
nGpuLayers: 99 // Offload layers to GPU (0 = CPU only)
})
// Create context
const ctx = new LlamaContext(model, {
contextSize: 2048, // Max context length
batchSize: 512 // Batch size for prompt processing
})
// Create sampler
const sampler = new LlamaSampler(model, {
temp: 0.7, // Temperature (0 = greedy)
topK: 40, // Top-K sampling
topP: 0.95 // Top-P (nucleus) sampling
})
// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)
// Cleanup
sampler.free()
ctx.free()
model.free()const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
contextSize: 512,
embeddings: true,
poolingType: 1 // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})
const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1) // Float32Array
// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)
ctx.free()
model.free()Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).
Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.
const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
contextSize: 512,
embeddings: true,
poolingType: 4 // rank pooling (required for rerankers)
})
function rerank (query, document) {
ctx.clearMemory() // critical: clear KV cache before each pair
const tokens = model.tokenize(query + '\n' + document, true)
ctx.decode(tokens)
return ctx.getEmbeddings(0)[0] // single float score
}
const query = 'What is machine learning?'
const docs = [
'Machine learning is a branch of AI that learns from data.',
'The recipe calls for two cups of flour and one egg.'
]
const scored = docs
.map((doc, i) => ({ i, score: rerank(query, doc) }))
.sort((a, b) => b.score - a.score)
for (const { i, score } of scored) {
console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}
ctx.free()
model.free()const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })
// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
type: 'object',
properties: { name: { type: 'string' }, age: { type: 'integer' } },
required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })
// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })| Example | Description |
|---|---|
examples/text-generation.js |
High-level generate() API |
examples/token-by-token.js |
Manual tokenize/sample/decode loop |
examples/cosine-similarity.js |
Embeddings + semantic similarity |
examples/json-constrained-output.js |
JSON schema constrained generation |
examples/lark-constrained-output.js |
Lark grammar constrained generation |
examples/tool-use-agent.js |
Agentic tool calling with multi-turn |
Run examples with:
bare examples/text-generation.js -- /path/to/model.ggufTests use brittle and skip gracefully when models aren't available.
npm testModel-dependent tests require Ollama models installed locally:
ollama pull llama3.2:1b # generation tests
ollama pull nomic-embed-text # embedding tests
ollama pull qllama/bge-reranker-v2-m3 # reranking testsnpm run benchResults are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.
new LlamaModel(path, options?)| Option | Type | Default | Description |
|---|---|---|---|
nGpuLayers |
number | 0 | Number of layers to offload to GPU |
Properties:
name- Model name from metadataembeddingDimension- Embedding vector sizetrainingContextSize- Training context length
Methods:
tokenize(text, addBos?)- Convert text to tokens (Int32Array)detokenize(tokens)- Convert tokens back to textisEogToken(token)- Check if token is end-of-generationgetMeta(key)- Get model metadata by keyfree()- Release model resources
new LlamaContext(model, options?)| Option | Type | Default | Description |
|---|---|---|---|
contextSize |
number | 512 | Maximum context length |
batchSize |
number | 512 | Batch size for processing |
embeddings |
boolean | false | Enable embedding mode |
poolingType |
number | -1 | Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank) |
Properties:
contextSize- Actual context size
Methods:
decode(tokens)- Process tokens through the modelgetEmbeddings(idx)- Get embedding vector (Float32Array)clearMemory()- Clear context for reuse (faster than creating new context)free()- Release context resources
new LlamaSampler(model, options?)| Option | Type | Default | Description |
|---|---|---|---|
temp |
number | 0 | Temperature (0 = greedy sampling) |
topK |
number | 40 | Top-K sampling parameter |
topP |
number | 0.95 | Top-P (nucleus) sampling parameter |
json |
string | - | JSON schema constraint (requires llguidance) |
lark |
string | - | Lark grammar constraint (requires llguidance) |
Methods:
sample(ctx, idx)- Sample next token (-1 for last position)accept(token)- Accept token into sampler statefree()- Release sampler resources
generate(model, ctx, sampler, prompt, maxTokens?)Convenience function for simple text generation. Returns the generated text (not including the prompt).
setQuiet(quiet?)- Suppress llama.cpp outputsetLogLevel(level)- Set log level (0=off, 1=errors, 2=all)readGgufMeta(path, key)- Read GGUF metadata without loading the modelgetModelName(path)- Get model name from GGUF filesystemInfo()- Get hardware/instruction set info (AVX, NEON, Metal, CUDA)
index.js Main module
binding.cpp C++ native bindings
lib/
ollama-models.js Ollama model discovery
ollama.js GGUF metadata + Jinja chat templates
test/ Brittle test suite
bench/ Benchmark system
examples/ Usage examples
tools/
ollama-hyperdrive.js P2P model distribution (standalone CLI)
This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.
| Platform | Architecture | GPU Support |
|---|---|---|
| macOS | arm64, x64 | Metal |
| Linux | x64, arm64 | CUDA (if available) |
| Windows | x64, arm64 | CUDA (if available) |
| iOS | arm64 | Metal |
| Android | arm64, arm, x64, ia32 | - |
JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.
Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.
MIT