End-to-end llama.cpp toolkit in Rust.
API client, HuggingFace Hub, server orchestration, 5-test benchmark suite.
- LlamaCppClient: Async OpenAI-compatible API client (chat/completion/embeddings)
- HubClient: Pure Rust HuggingFace Hub downloads with progress callbacks
- Server orchestration: Programmatic llama-server lifecycle management
- Benchmark suite: 5-test triage (throughput, tool calls, codegen, reasoning)
lancor pull <repo> [file] # Download GGUF from HF Hub
lancor list # List cached models
lancor search <query> # Search HF Hub
lancor rm <repo> <file> # Delete cached model
lancor bench <model|--all> # Run benchmark suite[dependencies]
lancor = "0.1.0"
tokio = { version = "1.0", features = ["full"] }use lancor::{LlamaCppClient, ChatCompletionRequest, Message};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let client = LlamaCppClient::new("http://localhost:8080")?;
let request = ChatCompletionRequest::new("model-name")
.message(Message::system("You are a helpful assistant."))
.message(Message::user("What is Rust?"))
.max_tokens(100);
let response = client.chat_completion(request).await?;
println!("{}", response.choices[0].message.content);
Ok(())
}OpenAI-compatible client for llama.cpp server (/v1/chat/completions, /v1/completions, /v1/embeddings).
use lancor::{LlamaCppClient, ChatCompletionRequest, CompletionRequest, EmbeddingRequest, Message};
// Create client
let client = LlamaCppClient::new("http://localhost:8080")?;
let client = LlamaCppClient::with_api_key("http://localhost:8080", "sk-...")?;
let client = LlamaCppClient::default()?; // localhost:8080
// Chat completion (non-streaming)
let request = ChatCompletionRequest::new("model")
.message(Message::user("Explain quantum computing"))
.temperature(0.7)
.max_tokens(200);
let response = client.chat_completion(request).await?;
// Streaming chat completion
let request = ChatCompletionRequest::new("model")
.message(Message::user("Write a short poem"))
.stream(true)
.max_tokens(100);
let mut stream = client.chat_completion_stream(request).await?;
while let Some(chunk) = stream.next().await {
if let Some(content) = &chunk.choices[0].delta.content {
print!("{}", content);
}
}
// Text completion
let request = CompletionRequest::new("model", "Once upon a time")
.max_tokens(50)
.temperature(0.8);
let response = client.completion(request).await?;
// Embeddings
let request = EmbeddingRequest::new("model", "Hello, world!");
let response = client.embedding(request).await?;
let embedding = &response.data[0].embedding;Request builders support: temperature, max_tokens, top_p, stream, stop, chat_template_kwargs (for chat) and prompt (for completion).
Download and manage GGUF models directly from HuggingFace Hub.
use lancor::hub::{HubClient, ProgressFn};
// Create client (auto-detects HF_TOKEN or ~/.cache/huggingface/token)
let hub = HubClient::new()?;
// Search models
let results = hub.search("qwen3.5 gguf", 10).await?;
for r in results {
println!("{} (downloads: {})", r.repo_id, r.downloads);
}
// List GGUF files in a repo
let files = hub.list_gguf("unsloth/Qwen3.5-35B-A3B-GGUF").await?;
for f in files {
let size_mb = f.size.unwrap_or(0) as f64 / 1_048_576.0;
println!("{} ({:.1} MB)", f.filename, size_mb);
}
// Download with progress
let progress: ProgressFn = Box::new(|downloaded, total| {
let pct = (downloaded as f64 / total as f64) * 100.0;
eprint!("\r{:.1}%", pct);
});
let path = hub.download("unsloth/Qwen3.5-35B-A3B-GGUF", "model-Q4_K_M.gguf", Some(progress)).await?;
println!("Saved: {}", path.display());
// List cached models
let cached = hub.list_cached()?;
for m in cached {
println!("{}: {} ({:.2} GB)", m.repo_id, m.filename, m.size as f64 / 1_073_741_824.0);
}
// Delete cached model
hub.delete("unsloth/Qwen3.5-35B-A3B-GGUF", "model-Q4_K_M.gguf").await?;Cache directory: ~/.cache/lancor/models/ (configurable via HubClient::with_cache_dir(path)).
Programmatic control over llama-server, llama-cli, llama-quantize, and llama-bench.
use lancor::server::{LlamaServer, ServerConfig};
// Configure server
let config = ServerConfig::new("model-Q4_K_M.gguf")
.host("127.0.0.1")
.port(8080)
.gpu_layers(99) // Offload layers to GPU
.ctx_size(8192) // Context length
.parallel(1) // Parallel sequences
.threads(4) // CPU threads
.batch_size(512) // Batch size for prompt processing
.flash_attn(true) // Enable flash attention
.mlock(true) // Lock model in RAM
.api_key("sk-...") // Require API key
.arg("--some-flag"); // Extra args
// Start server
let mut server = LlamaServer::start(&config)?;
server.wait_healthy(60).await?;
println!("Server ready at: {}", server.base_url());
// Use with client
let client = lancor::LlamaCppClient::new(server.base_url())?;
// ... make requests
// Stop server
server.stop()?;ServerConfig defaults: host=127.0.0.1, port=8080, n_gpu_layers=99, ctx_size=8192, n_parallel=1, cont_batching=true, metrics=true.
Run inference with llama-cli (captures stdout):
use lancor::server::CliConfig;
let config = CliConfig::new("model-Q4_K_M.gguf")
.prompt("What is Rust?")
.predict(100)
.temperature(0.7)
.interactive(); // Enable interactive mode
let output = lancor::server::run_cli(&config)?;
println!("{}", output);use lancor::server::{quantize, QuantType};
quantize(
"model-f32.gguf",
"model-Q4_K_M.gguf",
QuantType::Q4_K_M,
)?;Supported QuantType: Q4_0, Q4_1, Q4_K_S, Q4_K_M, Q5_0, Q5_1, Q5_K_S, Q5_K_M, Q6_K, Q8_0, IQ2_XXS, IQ2_XS, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS, F16, F32.
use lancor::server::bench;
let output = bench("model.gguf", 99, 8192)?;
println!("{}", output);5-test triage for comparing model quantizations and sizes.
use lancor::bench::{run_suite_managed, BenchConfig, print_table};
use lancor::server::ServerConfig;
// Single model (auto-starts/stops server)
let result = run_suite_managed(
&std::path::Path::new("model-Q4_K_M.gguf"),
"Qwen3.5-35B-Q4_K_M",
ServerConfig::new("model-Q4_K_M.gguf")
.gpu_layers(99)
.ctx_size(8192),
).await?;
// Against existing server
let cfg = BenchConfig::new("my-model", "model.gguf")
.base_url("http://localhost:8080");
let result = lancor::bench::run_suite(&cfg).await?;
// Compare multiple models
let models = vec![
("Q4_K_M", path1, ServerConfig::new(&path1).gpu_layers(99)),
("Q8_0", path2, ServerConfig::new(&path2).gpu_layers(99)),
];
let results = lancor::bench::compare(models).await?;
print_table(&results);Benchmark tests:
- Throughput: tokens/s for prompt processing and generation
- Tool call: single function call accuracy
- Multi-tool: parallel tool invocation (min 5 tools)
- Codegen: fizzbuzz implementation (score 0-4)
- Reasoning: logic puzzle correctness
Output example:
┌──────────────────┬───────┬──────────┬──────────┬──────┬───────┬──────┬───────────┐
│ Model │ Size │ PP tok/s │ TG tok/s │ Tool │ Multi │ Code │ Reasoning │
├──────────────────┼───────┼──────────┼──────────┼──────┼───────┼──────┼───────────┤
│ Qwen3.5-35B-Q4_K │ 20.1G│ 45.2 │ 128.7 │ ✓ │ 5/5 │ 4/4 │ ✓ │
└──────────────────┴───────┴──────────┴──────────┴──────┴───────┴──────┴───────────┘
JSON export: lancor::bench::to_json(results).
Download a GGUF model from HuggingFace Hub.
# List available GGUF files in a repo
lancor pull unsloth/Qwen3.5-35B-A3B-GGUF
# Download specific file
lancor pull unsloth/Qwen3.5-35B-A3B-GGUF model-Q4_K_M.ggufList all cached models.
lancor list
# Output:
# unsloth/Qwen3.5-35B-A3B-GGUF: model-Q4_K_M.gguf (20.12 GB)
# /home/user/.cache/lancor/models/unsloth--Qwen3.5-35B-A3B-GGUF/model-Q4_K_M.ggufSearch HuggingFace Hub for models.
lancor search "qwen3.5 gguf"
# Output:
# unsloth/Qwen3.5-35B-A3B-GGUF downloads=12345
# ...Delete a cached model file.
lancor rm unsloth/Qwen3.5-35B-A3B-GGUF model-Q4_K_M.ggufRun the benchmark suite.
# Benchmark a single model (auto-manages server)
lancor bench model-Q4_K_M.gguf --label "MyModel-Q4" --ngl 99 --ctx 8192
# Benchmark all cached models
lancor bench --all --ngl 99 --port 8081
# Benchmark against existing server
lancor bench --url http://localhost:8080 --label "Remote" model.gguf
# JSON output
lancor bench model.gguf --json > results.jsonBenchmark options:
--label NAME— Model label for results table--port PORT— Server port (default: 8080, for auto-managed)--ngl LAYERS— GPU layers (default: 99)--ctx SIZE— Context size (default: 8192)--url URL— Use existing server instead of starting one--all— Benchmark all cached GGUF models--json— Output JSON instead of table
- Rust 1.91+
- llama.cpp binaries on PATH:
llama-server,llama-cli,llama-quantize,llama-bench - For HubClient: network access to huggingface.co
./server -m model.gguf --port 8080 --api-key sk-... --metrics --cont-batchingThen use LlamaCppClient to interact with it.
| Project | What |
|---|---|
| ares | Agentic AI server — uses lancor for local llama.cpp inference |
| pawan | Self-healing CLI coding agent |
| daedra | Web search MCP server |
| thulp | Execution context engineering |
Built by DIRMACS.
GPL-3.0