Lancor

End-to-end llama.cpp toolkit in Rust.
API client, HuggingFace Hub, server orchestration, 5-test benchmark suite.

Features

Library

LlamaCppClient: Async OpenAI-compatible API client (chat/completion/embeddings)
HubClient: Pure Rust HuggingFace Hub downloads with progress callbacks
Server orchestration: Programmatic llama-server lifecycle management
Benchmark suite: 5-test triage (throughput, tool calls, codegen, reasoning)

CLI

lancor pull <repo> [file]     # Download GGUF from HF Hub
lancor list                   # List cached models
lancor search <query>         # Search HF Hub
lancor rm <repo> <file>       # Delete cached model
lancor bench <model|--all>    # Run benchmark suite

Installation

[dependencies]
lancor = "0.1.0"
tokio = { version = "1.0", features = ["full"] }

Quick Start

use lancor::{LlamaCppClient, ChatCompletionRequest, Message};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let client = LlamaCppClient::new("http://localhost:8080")?;

    let request = ChatCompletionRequest::new("model-name")
        .message(Message::system("You are a helpful assistant."))
        .message(Message::user("What is Rust?"))
        .max_tokens(100);

    let response = client.chat_completion(request).await?;
    println!("{}", response.choices[0].message.content);
    Ok(())
}

LlamaCppClient

OpenAI-compatible client for llama.cpp server (/v1/chat/completions, /v1/completions, /v1/embeddings).

use lancor::{LlamaCppClient, ChatCompletionRequest, CompletionRequest, EmbeddingRequest, Message};

// Create client
let client = LlamaCppClient::new("http://localhost:8080")?;
let client = LlamaCppClient::with_api_key("http://localhost:8080", "sk-...")?;
let client = LlamaCppClient::default()?;  // localhost:8080

// Chat completion (non-streaming)
let request = ChatCompletionRequest::new("model")
    .message(Message::user("Explain quantum computing"))
    .temperature(0.7)
    .max_tokens(200);
let response = client.chat_completion(request).await?;

// Streaming chat completion
let request = ChatCompletionRequest::new("model")
    .message(Message::user("Write a short poem"))
    .stream(true)
    .max_tokens(100);
let mut stream = client.chat_completion_stream(request).await?;
while let Some(chunk) = stream.next().await {
    if let Some(content) = &chunk.choices[0].delta.content {
        print!("{}", content);
    }
}

// Text completion
let request = CompletionRequest::new("model", "Once upon a time")
    .max_tokens(50)
    .temperature(0.8);
let response = client.completion(request).await?;

// Embeddings
let request = EmbeddingRequest::new("model", "Hello, world!");
let response = client.embedding(request).await?;
let embedding = &response.data[0].embedding;

Request builders support: temperature, max_tokens, top_p, stream, stop, chat_template_kwargs (for chat) and prompt (for completion).

HuggingFace Hub

Download and manage GGUF models directly from HuggingFace Hub.

use lancor::hub::{HubClient, ProgressFn};

// Create client (auto-detects HF_TOKEN or ~/.cache/huggingface/token)
let hub = HubClient::new()?;

// Search models
let results = hub.search("qwen3.5 gguf", 10).await?;
for r in results {
    println!("{} (downloads: {})", r.repo_id, r.downloads);
}

// List GGUF files in a repo
let files = hub.list_gguf("unsloth/Qwen3.5-35B-A3B-GGUF").await?;
for f in files {
    let size_mb = f.size.unwrap_or(0) as f64 / 1_048_576.0;
    println!("{} ({:.1} MB)", f.filename, size_mb);
}

// Download with progress
let progress: ProgressFn = Box::new(|downloaded, total| {
    let pct = (downloaded as f64 / total as f64) * 100.0;
    eprint!("\r{:.1}%", pct);
});
let path = hub.download("unsloth/Qwen3.5-35B-A3B-GGUF", "model-Q4_K_M.gguf", Some(progress)).await?;
println!("Saved: {}", path.display());

// List cached models
let cached = hub.list_cached()?;
for m in cached {
    println!("{}: {} ({:.2} GB)", m.repo_id, m.filename, m.size as f64 / 1_073_741_824.0);
}

// Delete cached model
hub.delete("unsloth/Qwen3.5-35B-A3B-GGUF", "model-Q4_K_M.gguf").await?;

Cache directory: ~/.cache/lancor/models/ (configurable via HubClient::with_cache_dir(path)).

Server Orchestration

Programmatic control over llama-server, llama-cli, llama-quantize, and llama-bench.

LlamaServer

use lancor::server::{LlamaServer, ServerConfig};

// Configure server
let config = ServerConfig::new("model-Q4_K_M.gguf")
    .host("127.0.0.1")
    .port(8080)
    .gpu_layers(99)      // Offload layers to GPU
    .ctx_size(8192)      // Context length
    .parallel(1)         // Parallel sequences
    .threads(4)          // CPU threads
    .batch_size(512)     // Batch size for prompt processing
    .flash_attn(true)    // Enable flash attention
    .mlock(true)         // Lock model in RAM
    .api_key("sk-...")   // Require API key
    .arg("--some-flag"); // Extra args

// Start server
let mut server = LlamaServer::start(&config)?;
server.wait_healthy(60).await?;
println!("Server ready at: {}", server.base_url());

// Use with client
let client = lancor::LlamaCppClient::new(server.base_url())?;
// ... make requests

// Stop server
server.stop()?;

ServerConfig defaults: host=127.0.0.1, port=8080, n_gpu_layers=99, ctx_size=8192, n_parallel=1, cont_batching=true, metrics=true.

LlamaCli

Run inference with llama-cli (captures stdout):

use lancor::server::CliConfig;

let config = CliConfig::new("model-Q4_K_M.gguf")
    .prompt("What is Rust?")
    .predict(100)
    .temperature(0.7)
    .interactive();  // Enable interactive mode

let output = lancor::server::run_cli(&config)?;
println!("{}", output);

Quantization

use lancor::server::{quantize, QuantType};

quantize(
    "model-f32.gguf",
    "model-Q4_K_M.gguf",
    QuantType::Q4_K_M,
)?;

Supported QuantType: Q4_0, Q4_1, Q4_K_S, Q4_K_M, Q5_0, Q5_1, Q5_K_S, Q5_K_M, Q6_K, Q8_0, IQ2_XXS, IQ2_XS, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS, F16, F32.

Raw llama-bench wrapper

use lancor::server::bench;

let output = bench("model.gguf", 99, 8192)?;
println!("{}", output);

Benchmark Suite

5-test triage for comparing model quantizations and sizes.

use lancor::bench::{run_suite_managed, BenchConfig, print_table};
use lancor::server::ServerConfig;

// Single model (auto-starts/stops server)
let result = run_suite_managed(
    &std::path::Path::new("model-Q4_K_M.gguf"),
    "Qwen3.5-35B-Q4_K_M",
    ServerConfig::new("model-Q4_K_M.gguf")
        .gpu_layers(99)
        .ctx_size(8192),
).await?;

// Against existing server
let cfg = BenchConfig::new("my-model", "model.gguf")
    .base_url("http://localhost:8080");
let result = lancor::bench::run_suite(&cfg).await?;

// Compare multiple models
let models = vec![
    ("Q4_K_M", path1, ServerConfig::new(&path1).gpu_layers(99)),
    ("Q8_0", path2, ServerConfig::new(&path2).gpu_layers(99)),
];
let results = lancor::bench::compare(models).await?;
print_table(&results);

Benchmark tests:

Throughput: tokens/s for prompt processing and generation
Tool call: single function call accuracy
Multi-tool: parallel tool invocation (min 5 tools)
Codegen: fizzbuzz implementation (score 0-4)
Reasoning: logic puzzle correctness

Output example:

┌──────────────────┬───────┬──────────┬──────────┬──────┬───────┬──────┬───────────┐
│ Model            │ Size  │ PP tok/s │ TG tok/s │ Tool │ Multi │ Code │ Reasoning │
├──────────────────┼───────┼──────────┼──────────┼──────┼───────┼──────┼───────────┤
│ Qwen3.5-35B-Q4_K │  20.1G│     45.2 │    128.7 │  ✓   │ 5/5   │ 4/4  │     ✓     │
└──────────────────┴───────┴──────────┴──────────┴──────┴───────┴──────┴───────────┘

JSON export: lancor::bench::to_json(results).

CLI Reference

`lancor pull <repo> [file]`

Download a GGUF model from HuggingFace Hub.

# List available GGUF files in a repo
lancor pull unsloth/Qwen3.5-35B-A3B-GGUF

# Download specific file
lancor pull unsloth/Qwen3.5-35B-A3B-GGUF model-Q4_K_M.gguf

`lancor list`

List all cached models.

lancor list
# Output:
# unsloth/Qwen3.5-35B-A3B-GGUF: model-Q4_K_M.gguf (20.12 GB)
#   /home/user/.cache/lancor/models/unsloth--Qwen3.5-35B-A3B-GGUF/model-Q4_K_M.gguf

`lancor search <query>`

Search HuggingFace Hub for models.

lancor search "qwen3.5 gguf"
# Output:
# unsloth/Qwen3.5-35B-A3B-GGUF                     downloads=12345
# ...

`lancor rm <repo> <file>`

Delete a cached model file.

lancor rm unsloth/Qwen3.5-35B-A3B-GGUF model-Q4_K_M.gguf

`lancor bench <model|--all> [options]`

Run the benchmark suite.

# Benchmark a single model (auto-manages server)
lancor bench model-Q4_K_M.gguf --label "MyModel-Q4" --ngl 99 --ctx 8192

# Benchmark all cached models
lancor bench --all --ngl 99 --port 8081

# Benchmark against existing server
lancor bench --url http://localhost:8080 --label "Remote" model.gguf

# JSON output
lancor bench model.gguf --json > results.json

Benchmark options:

--label NAME — Model label for results table
--port PORT — Server port (default: 8080, for auto-managed)
--ngl LAYERS — GPU layers (default: 99)
--ctx SIZE — Context size (default: 8192)
--url URL — Use existing server instead of starting one
--all — Benchmark all cached GGUF models
--json — Output JSON instead of table

Requirements

Rust 1.91+
llama.cpp binaries on PATH: llama-server, llama-cli, llama-quantize, llama-bench
For HubClient: network access to huggingface.co

Running llama-server manually

./server -m model.gguf --port 8080 --api-key sk-... --metrics --cont-batching

Then use LlamaCppClient to interact with it.

Ecosystem

Project	What
ares	Agentic AI server — uses lancor for local llama.cpp inference
pawan	Self-healing CLI coding agent
daedra	Web search MCP server
thulp	Execution context engineering

Built by DIRMACS.

License

GPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
examples		examples
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lancor

Features

Library

CLI

Installation

Quick Start

LlamaCppClient

HuggingFace Hub

Server Orchestration

LlamaServer

LlamaCli

Quantization

Raw llama-bench wrapper

Benchmark Suite

CLI Reference

`lancor pull <repo> [file]`

`lancor list`

`lancor search <query>`

`lancor rm <repo> <file>`

`lancor bench <model|--all> [options]`

Requirements

Running llama-server manually

Ecosystem

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lancor

Features

Library

CLI

Installation

Quick Start

LlamaCppClient

HuggingFace Hub

Server Orchestration

LlamaServer

LlamaCli

Quantization

Raw llama-bench wrapper

Benchmark Suite

CLI Reference

lancor pull <repo> [file]

lancor list

lancor search <query>

lancor rm <repo> <file>

lancor bench <model|--all> [options]

Requirements

Running llama-server manually

Ecosystem

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`lancor pull <repo> [file]`

`lancor list`

`lancor search <query>`

`lancor rm <repo> <file>`

`lancor bench <model|--all> [options]`

Packages