bare-llama.cpp

Native llama.cpp bindings for Bare.

Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.

Requirements

CMake 3.25+
C/C++ compiler (clang, gcc, or MSVC)
Node.js (for npm/cmake-bare)
Bare runtime

Building

Clone with submodules:

git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cpp

Or if already cloned:

git submodule update --init --recursive

Install dependencies and build:

npm install

Or manually:

bare-make generate
bare-make build
bare-make install

This creates prebuilds/<platform>-<arch>/bare-llama.bare.

Build Options

For a debug build:

bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make build

To disable GPU acceleration:

bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make build

Usage

const { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')

// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
  nGpuLayers: 99  // Offload layers to GPU (0 = CPU only)
})

// Create context
const ctx = new LlamaContext(model, {
  contextSize: 2048,  // Max context length
  batchSize: 512      // Batch size for prompt processing
})

// Create sampler
const sampler = new LlamaSampler(model, {
  temp: 0.7,    // Temperature (0 = greedy)
  topK: 40,     // Top-K sampling
  topP: 0.95    // Top-P (nucleus) sampling
})

// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)

// Cleanup
sampler.free()
ctx.free()
model.free()

Embeddings

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 1  // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})

const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1)  // Float32Array

// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)

ctx.free()
model.free()

Reranking

Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).

Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 4  // rank pooling (required for rerankers)
})

function rerank (query, document) {
  ctx.clearMemory()  // critical: clear KV cache before each pair
  const tokens = model.tokenize(query + '\n' + document, true)
  ctx.decode(tokens)
  return ctx.getEmbeddings(0)[0]  // single float score
}

const query = 'What is machine learning?'
const docs = [
  'Machine learning is a branch of AI that learns from data.',
  'The recipe calls for two cups of flour and one egg.'
]

const scored = docs
  .map((doc, i) => ({ i, score: rerank(query, doc) }))
  .sort((a, b) => b.score - a.score)

for (const { i, score } of scored) {
  console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}

ctx.free()
model.free()

Constrained Generation

const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })

// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
  type: 'object',
  properties: { name: { type: 'string' }, age: { type: 'integer' } },
  required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })

// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })

Examples

Example	Description
`examples/text-generation.js`	High-level `generate()` API
`examples/token-by-token.js`	Manual tokenize/sample/decode loop
`examples/cosine-similarity.js`	Embeddings + semantic similarity
`examples/json-constrained-output.js`	JSON schema constrained generation
`examples/lark-constrained-output.js`	Lark grammar constrained generation
`examples/tool-use-agent.js`	Agentic tool calling with multi-turn

Run examples with:

bare examples/text-generation.js -- /path/to/model.gguf

Testing

Tests use brittle and skip gracefully when models aren't available.

npm test

Model-dependent tests require Ollama models installed locally:

ollama pull llama3.2:1b        # generation tests
ollama pull nomic-embed-text   # embedding tests
ollama pull qllama/bge-reranker-v2-m3  # reranking tests

Benchmarks

npm run bench

Results are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.

API Reference

LlamaModel

new LlamaModel(path, options?)

Option	Type	Default	Description
`nGpuLayers`	number	0	Number of layers to offload to GPU

Properties:

name - Model name from metadata
embeddingDimension - Embedding vector size
trainingContextSize - Training context length

Methods:

tokenize(text, addBos?) - Convert text to tokens (Int32Array)
detokenize(tokens) - Convert tokens back to text
isEogToken(token) - Check if token is end-of-generation
getMeta(key) - Get model metadata by key
free() - Release model resources

LlamaContext

new LlamaContext(model, options?)

Option	Type	Default	Description
`contextSize`	number	512	Maximum context length
`batchSize`	number	512	Batch size for processing
`embeddings`	boolean	false	Enable embedding mode
`poolingType`	number	-1	Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank)

Properties:

contextSize - Actual context size

Methods:

decode(tokens) - Process tokens through the model
getEmbeddings(idx) - Get embedding vector (Float32Array)
clearMemory() - Clear context for reuse (faster than creating new context)
free() - Release context resources

LlamaSampler

new LlamaSampler(model, options?)

Option	Type	Default	Description
`temp`	number	0	Temperature (0 = greedy sampling)
`topK`	number	40	Top-K sampling parameter
`topP`	number	0.95	Top-P (nucleus) sampling parameter
`json`	string	-	JSON schema constraint (requires llguidance)
`lark`	string	-	Lark grammar constraint (requires llguidance)

Methods:

sample(ctx, idx) - Sample next token (-1 for last position)
accept(token) - Accept token into sampler state
free() - Release sampler resources

generate()

generate(model, ctx, sampler, prompt, maxTokens?)

Convenience function for simple text generation. Returns the generated text (not including the prompt).

Utility Functions

setQuiet(quiet?) - Suppress llama.cpp output
setLogLevel(level) - Set log level (0=off, 1=errors, 2=all)
readGgufMeta(path, key) - Read GGUF metadata without loading the model
getModelName(path) - Get model name from GGUF file
systemInfo() - Get hardware/instruction set info (AVX, NEON, Metal, CUDA)

Project Structure

index.js              Main module
binding.cpp           C++ native bindings
lib/
  ollama-models.js    Ollama model discovery
  ollama.js           GGUF metadata + Jinja chat templates
test/                 Brittle test suite
bench/                Benchmark system
examples/             Usage examples
tools/
  ollama-hyperdrive.js  P2P model distribution (standalone CLI)

Models

This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.

Platform Support

Platform	Architecture	GPU Support
macOS	arm64, x64	Metal
Linux	x64, arm64	CUDA (if available)
Windows	x64, arm64	CUDA (if available)
iOS	arm64	Metal
Android	arm64, arm, x64, ia32	-

Constrained generation (llguidance)

JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.

Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
bench		bench
examples		examples
lib		lib
test		test
tools		tools
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
benchmark.md		benchmark.md
binding.cpp		binding.cpp
binding.js		binding.js
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bare-llama.cpp

Requirements

Building

Build Options

Usage

Embeddings

Reranking

Constrained Generation

Examples

Testing

Benchmarks

API Reference

LlamaModel

LlamaContext

LlamaSampler

generate()

Utility Functions

Project Structure

Models

Platform Support

Constrained generation (llguidance)

License

About

Uh oh!

Releases 1

Packages

Languages

License

CameronTofer/bare-llama.cpp

Folders and files

Latest commit

History

Repository files navigation

bare-llama.cpp

Requirements

Building

Build Options

Usage

Embeddings

Reranking

Constrained Generation

Examples

Testing

Benchmarks

API Reference

LlamaModel

LlamaContext

LlamaSampler

generate()

Utility Functions

Project Structure

Models

Platform Support

Constrained generation (llguidance)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages