Skip to content

CameronTofer/bare-llama.cpp

Repository files navigation

bare-llama.cpp

Native llama.cpp bindings for Bare.

Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.

Requirements

  • CMake 3.25+
  • C/C++ compiler (clang, gcc, or MSVC)
  • Node.js (for npm/cmake-bare)
  • Bare runtime

Building

Clone with submodules:

git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cpp

Or if already cloned:

git submodule update --init --recursive

Install dependencies and build:

npm install

Or manually:

bare-make generate
bare-make build
bare-make install

This creates prebuilds/<platform>-<arch>/bare-llama.bare.

Build Options

For a debug build:

bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make build

To disable GPU acceleration:

bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make build

Usage

const { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')

// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
  nGpuLayers: 99  // Offload layers to GPU (0 = CPU only)
})

// Create context
const ctx = new LlamaContext(model, {
  contextSize: 2048,  // Max context length
  batchSize: 512      // Batch size for prompt processing
})

// Create sampler
const sampler = new LlamaSampler(model, {
  temp: 0.7,    // Temperature (0 = greedy)
  topK: 40,     // Top-K sampling
  topP: 0.95    // Top-P (nucleus) sampling
})

// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)

// Cleanup
sampler.free()
ctx.free()
model.free()

Embeddings

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 1  // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})

const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1)  // Float32Array

// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)

ctx.free()
model.free()

Reranking

Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).

Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 4  // rank pooling (required for rerankers)
})

function rerank (query, document) {
  ctx.clearMemory()  // critical: clear KV cache before each pair
  const tokens = model.tokenize(query + '\n' + document, true)
  ctx.decode(tokens)
  return ctx.getEmbeddings(0)[0]  // single float score
}

const query = 'What is machine learning?'
const docs = [
  'Machine learning is a branch of AI that learns from data.',
  'The recipe calls for two cups of flour and one egg.'
]

const scored = docs
  .map((doc, i) => ({ i, score: rerank(query, doc) }))
  .sort((a, b) => b.score - a.score)

for (const { i, score } of scored) {
  console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}

ctx.free()
model.free()

Constrained Generation

const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })

// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
  type: 'object',
  properties: { name: { type: 'string' }, age: { type: 'integer' } },
  required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })

// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })

Examples

Example Description
examples/text-generation.js High-level generate() API
examples/token-by-token.js Manual tokenize/sample/decode loop
examples/cosine-similarity.js Embeddings + semantic similarity
examples/json-constrained-output.js JSON schema constrained generation
examples/lark-constrained-output.js Lark grammar constrained generation
examples/tool-use-agent.js Agentic tool calling with multi-turn

Run examples with:

bare examples/text-generation.js -- /path/to/model.gguf

Testing

Tests use brittle and skip gracefully when models aren't available.

npm test

Model-dependent tests require Ollama models installed locally:

ollama pull llama3.2:1b        # generation tests
ollama pull nomic-embed-text   # embedding tests
ollama pull qllama/bge-reranker-v2-m3  # reranking tests

Benchmarks

npm run bench

Results are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.

API Reference

LlamaModel

new LlamaModel(path, options?)
Option Type Default Description
nGpuLayers number 0 Number of layers to offload to GPU

Properties:

  • name - Model name from metadata
  • embeddingDimension - Embedding vector size
  • trainingContextSize - Training context length

Methods:

  • tokenize(text, addBos?) - Convert text to tokens (Int32Array)
  • detokenize(tokens) - Convert tokens back to text
  • isEogToken(token) - Check if token is end-of-generation
  • getMeta(key) - Get model metadata by key
  • free() - Release model resources

LlamaContext

new LlamaContext(model, options?)
Option Type Default Description
contextSize number 512 Maximum context length
batchSize number 512 Batch size for processing
embeddings boolean false Enable embedding mode
poolingType number -1 Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank)

Properties:

  • contextSize - Actual context size

Methods:

  • decode(tokens) - Process tokens through the model
  • getEmbeddings(idx) - Get embedding vector (Float32Array)
  • clearMemory() - Clear context for reuse (faster than creating new context)
  • free() - Release context resources

LlamaSampler

new LlamaSampler(model, options?)
Option Type Default Description
temp number 0 Temperature (0 = greedy sampling)
topK number 40 Top-K sampling parameter
topP number 0.95 Top-P (nucleus) sampling parameter
json string - JSON schema constraint (requires llguidance)
lark string - Lark grammar constraint (requires llguidance)

Methods:

  • sample(ctx, idx) - Sample next token (-1 for last position)
  • accept(token) - Accept token into sampler state
  • free() - Release sampler resources

generate()

generate(model, ctx, sampler, prompt, maxTokens?)

Convenience function for simple text generation. Returns the generated text (not including the prompt).

Utility Functions

  • setQuiet(quiet?) - Suppress llama.cpp output
  • setLogLevel(level) - Set log level (0=off, 1=errors, 2=all)
  • readGgufMeta(path, key) - Read GGUF metadata without loading the model
  • getModelName(path) - Get model name from GGUF file
  • systemInfo() - Get hardware/instruction set info (AVX, NEON, Metal, CUDA)

Project Structure

index.js              Main module
binding.cpp           C++ native bindings
lib/
  ollama-models.js    Ollama model discovery
  ollama.js           GGUF metadata + Jinja chat templates
test/                 Brittle test suite
bench/                Benchmark system
examples/             Usage examples
tools/
  ollama-hyperdrive.js  P2P model distribution (standalone CLI)

Models

This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.

Platform Support

Platform Architecture GPU Support
macOS arm64, x64 Metal
Linux x64, arm64 CUDA (if available)
Windows x64, arm64 CUDA (if available)
iOS arm64 Metal
Android arm64, arm, x64, ia32 -

Constrained generation (llguidance)

JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.

Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published