Llama Bro SDK

Run a full AI model in your pocket. On your terms. No servers. No subscriptions. No data leaving your phone.

The Problem with Cloud AI

Every time you send a message to a cloud LLM, that message travels to a datacenter. It's logged, processed, and potentially used to train the next model. Your health questions, your legal queries, your private relationship advice — all of it leaves your device.

Llama Bro is the answer to that.

We wrap llama.cpp in a clean, idiomatic Kotlin SDK so you can run state-of-the-art models — Llama 3, Gemma, DeepSeek-R1, Qwen 2.5 — directly on the device. No API keys. No usage limits. No data residency concerns. Your model, your hardware, your rules.

See it in Action

1773842091361990.mp4

_{Real-time token streaming on Snapdragon 8 Elite. No cloud. No lag.}

What's New — Declarative Inference Pipeline

The headline feature of the most recent architectural refactor is the Declarative Inference Pipeline — a fully reactive, allocation-optimized token processing engine that maps raw native output directly to your UI without a single blocking call.

┌─────────────────────────────────────────────────────────────────┐
│  1. USER PROMPT                                                 │
│     chat.completion(ChatEvent.UserEvent("Hello", think = true)) │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  2. PROMPT FORMATTER                                            │
│     Wraps the message in model-specific chat markers:           │
│     <|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n  │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  3. NATIVE GENERATOR                                            │
│     llama_decode() → channelFlow { send(token) }                │
│     Running on Dispatchers.IO, legally cross-context.           │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  4. DFA LEXER (AllocationOptimizedScanner)                      │
│     Scans the raw token stream character-by-character.          │
│     Detects: text | <think>...</think> | <tool_call>...</tool_call>│
│     Uses StringBuilder, not String concat — 0 GC pressure.     │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  5. SEMANTIC CHUNKING                                           │
│     Emits typed chunks: TextChunk | ThinkingChunk | ToolChunk   │
│     Assembled into AssistantEvent.Part objects.                 │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  6. COMPLETION SNAPSHOT                                         │
│     Each emission: { message, tokensPerSecond, isComplete }     │
│     Your UI collects this — full content, always cumulative.    │
└─────────────────────────────────────────────────────────────────┘

How Pipeline Composition Works

LlamaEngine.createFlow(modelDefinition)          // Load model → ResourceState<LlamaEngine>
    .flatMapResource { engine ->                  // When loaded, create session
        engine.createSessionFlow(sessionConfig)
    }
    .flatMapResource { session ->                 // When session ready, create chat
        session.createChatSessionFlow(systemPrompt)
    }
    .filterSuccess()                              // Extract the chat session
    .flatMapLatest { chat ->                      // On each user turn
        chat.completion(userEvent)
    }
    .collect { snapshot ->                        // UI-ready snapshot, on every token
        updateTextView(snapshot.message.text)
        if (snapshot.isComplete) saveToDb(snapshot)
    }

No threading code. No callbacks. No lifecycle leaks. Cancellation is free.

Features

Zero-Allocation Streaming — DFA-based scanner (AllocationOptimizedScanner) uses StringBuilder internally and avoids per-token heap allocations, keeping the UI thread smooth
Thinking Block Extraction — First-class support for <think>...</think> in reasoning models (DeepSeek-R1, QwQ, MiniMax). Thinking text and response text are separated automatically
Declarative Flow API — ResourceState<T> ADT with flatMapResource, filterSuccess, onEachLoading, and fold operators for composing resource loads declaratively
Prompt Format Library — 6 built-in chat templates (Gemma, Llama 3, ChatML, DeepSeek-R1, Mistral, Nemotron) + QWEN_2_5 alias + support for fully custom formats, including "turn-start" injection for forcing thinking
Overflow Management — 3 strategies for handling full KV caches: Halt, ClearHistory, RollingWindow — configurable per session
Type-Safe Errors — LlamaError sealed class maps every native failure to a named subtype. No raw exceptions from the JNI boundary
History Replay — feedHistory(List<ChatEvent>) pre-populates the KV cache with a prior conversation, so follow-up generations are contextual

Built-In Prompt Formats

Template	Protocol	Best For
`GEMMA`	`<start_of_turn>` / `<end_of_turn>`	Google Gemma / Gemma 2 / Gemma 3n
`LLAMA_3`	`<\|start_header_id\|>` / `<\|eot_id\|>`	Llama 3 / 3.1 / 3.2 / 3.3
`CHAT_ML`	`<\|im_start\|>` / `<\|im_end\|>`	SmolLM2, Qwen 2.5, Yi, Hermes
`QWEN_2_5`	alias for `CHAT_ML`	Qwen 2.5 (convenient alias)
`DEEPSEEK_R1`	`<｜begin of sentence｜>` / `<｜end of sentence｜>`	DeepSeek-R1 / R1-Distill family
`MISTRAL`	`[INST]` / `[/INST]`	Mistral 7B, Mixtral 8x7B
`NEMOTRON`	`<extra_id_0>` / `<extra_id_1>`	NVIDIA Nemotron-Mini

Installation

1. Add JitPack to your repositories

// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        google()
        mavenCentral()
        maven { url = uri("https://jitpack.io") }
    }
}

2. Add the dependency

// build.gradle.kts (app)
dependencies {
    implementation("com.github.whyisitworking:llama-bro:<LATEST_VERSION>")
}

Check the JitPack badge above for the latest version.

Prerequisites

Download a GGUF Model

Grab a GGUF-quantised model from Hugging Face.

Recommended starting points:

Model	Size	Format	Recommended Source
Llama 3.2 1B	~600 MB	`LLAMA_3`	bartowski/Llama-3.2-1B-Instruct-GGUF
Gemma 3n 2B	~3 GB	`GEMMA`	unsloth/gemma-3n-E2B-it-GGUF
Qwen 2.5 0.5B	~400 MB	`QWEN_2_5`	bartowski/Qwen2.5-0.5B-Instruct-GGUF
DeepSeek-R1 7B	~5 GB	`DEEPSEEK_R1`	bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF

Quantization guide: Q4_K_M is the mobile sweet spot — best quality-to-speed tradeoff. Go Q3_K_M for RAM-constrained devices. Go Q5_K_M for maximum quality (larger, slower).

Quick Start

import com.suhel.llamabro.sdk.LlamaEngine
import com.suhel.llamabro.sdk.config.*
import com.suhel.llamabro.sdk.models.*
import com.suhel.llamabro.sdk.format.PromptFormats
import com.suhel.llamabro.sdk.model.*

lifecycleScope.launch {
    LlamaEngine.createFlow(
        ModelDefinition(
            loadConfig = ModelLoadConfig(path = "/path/to/model.gguf"),
            promptFormat = PromptFormats.CHAT_ML,
        )
    )
    .onEachLoading { progress ->
        progressBar.progress = ((progress ?: 0f) * 100).toInt()
    }
    .flatMapResource { engine ->
        engine.createSessionFlow(
            SessionConfig(
                contextSize = 4096,
                overflowStrategy = OverflowStrategy.RollingWindow(dropTokens = 500),
                inferenceConfig = InferenceConfig(temperature = 0.7f, minP = 0.1f)
            )
        )
    }
    .flatMapResource { session ->
        session.createChatSessionFlow("You are a helpful assistant.")
    }
    .filterSuccess()
    .flatMapLatest { chat ->
        chat.completion(ChatEvent.UserEvent("Explain coroutines.", think = false))
    }
    .collect { snapshot ->
        textView.text = snapshot.message.text
        if (snapshot.isComplete) {
            speedLabel.text = "${snapshot.tokensPerSecond} tok/s"
        }
    }
}

API Overview

The SDK is layered. Each tier adds abstraction. Use what your use case demands.

`LlamaEngine` — The Model Loader

Loads the GGUF file and manages model weights. Creates sessions on demand. Keep one engine per model across the app.

// Recommended: Flow-based (auto-cleanup on coroutine cancellation)
LlamaEngine.createFlow(modelDefinition)
    .onEachLoading { progress -> showProgress(progress) }
    .flatMapResource { engine -> /* use engine */ }

// Manual: You manage the lifecycle
val engine = LlamaEngine.create(modelDefinition) { progress -> true /* return false to cancel */ }
val session = engine.createSession(sessionConfig)
engine.close() // Releases native memory

`LlamaSession` — The Token Engine

Manages the KV cache, token encoding, and sampling. Mutex-serialized for thread safety.

// Use the Flow API for standard sampling
session.generateFlow().collect { result ->
    print(result.token ?: "")
    if (result.isComplete) return@collect
}

When to use directly: Implementing custom sampling loops, tool injection, or token-level diagnostics. Most apps should use LlamaChatSession instead.

`LlamaChatSession` — The Chat API

Handles prompt formatting, stop-token detection, thinking-block extraction, and metrics. This is where 95% of integrations start and end.

chat.completion(ChatEvent.UserEvent("Hello!", think = true)).collect { snapshot ->
    // snapshot.message.text          → Visible response
    // snapshot.message.thinkingText  → Hidden reasoning
    // snapshot.tokensPerSecond       → Generation speed
    // snapshot.isComplete            → True when done
}

Configuration Reference

`ModelDefinition`

The root configuration object. Bundles load settings with the prompt format.

ModelDefinition(
    loadConfig = ModelLoadConfig(
        path = "/data/user/0/com.myapp/files/model.gguf",
        threads = 8,         // Match your device's performance core count
        useMMap = true,      // Memory-map the file (recommended)
        useMLock = false     // Lock in RAM (prevent OS swap — high-memory devices only)
    ),
    promptFormat = PromptFormats.LLAMA_3,
    features = listOf(ThinkingMarker)  // Enable thinking injection for reasoning models
)

`SessionConfig`

Option	Default	Notes
`contextSize`	`2048`	Token budget for the entire conversation (prompt + response)
`overflowStrategy`	`RollingWindow(500)`	What happens when the KV cache fills up
`inferenceConfig`	See below	Sampling parameters
`decodeConfig`	See below	I/O batch sizes for performance tuning
`seed`	`-1` (random)	Set an integer for reproducible outputs

`InferenceConfig` — Sampling

Option	Default	Range	Effect
`temperature`	`0.8f`	`0.0–2.0`	Randomness. `0.0` = deterministic greedy, `1.0` = neutral.
`repeatPenalty`	`1.0f`	`1.0–2.0`	Discourages the model from repeating recent tokens.
`presencePenalty`	`0.0f`	`0.0–2.0`	Penalizes all tokens that have appeared, not just recent ones.
`minP`	`0.1f`	`0.0–1.0`	Min-probability filter. Cuts "hallucination tail" tokens cleanly.
`topP`	`null`	`0.0–1.0`	Nucleus sampling. `null` = disabled.
`topK`	`null`	`1–∞`	Top-K sampling. `null` = disabled.

`DecodeConfig` — Performance

Option	Default	Notes
`batchSize`	`2048`	Max tokens processed per decode step.
`microBatchSize`	`512`	Internal chunking granularity. Lower = less RAM.

Increase batchSize to 4096 for faster long-prompt prefill. Reduce it on RAM-constrained devices.

Overflow Strategies

Strategy	Behavior	Best For
`Halt`	Throws `LlamaError.ContextOverflow`	Strict determinism, batch processing
`ClearHistory`	Wipes context, reloads system prompt, continues	Short-session apps
`RollingWindow(n)`	Evicts oldest `n` tokens, keeps chatting	Long conversational flows (recommended)

Thinking Blocks & Reasoning Models

Reasoning models like DeepSeek-R1 and QwQ expose their internal chain-of-thought inside <think>...</think> tags. Llama Bro automatically extracts these into a separate part of the AssistantEvent.

// Set think = true to inject the opening <think> tag,
// forcing the model into reasoning mode.
val userEvent = ChatEvent.UserEvent(
    content = "What is 17 × 23? Show your work.",
    think = true
)

chat.completion(userEvent).collect { snapshot ->
    // Display reasoning in a collapsible section
    val reasoning = snapshot.message.thinkingText   // "Let me calculate 17 × 23..."
    val answer    = snapshot.message.text            // "The answer is 391."

    if (snapshot.isComplete) {
        println("${snapshot.tokensPerSecond} tokens/sec")
    }
}

The think = true parameter only works on models with ThinkingMarker in their ModelDefinition.features. On non-thinking models, it is silently ignored — making the API safe to use unconditionally.

Error Handling

All native failures cross the JNI boundary as typed LlamaError subtypes:

sealed class LlamaError : Exception() {
    class ModelNotFound(val path: String) : LlamaError()
    class ModelLoadFailed(val path: String, cause: Throwable?) : LlamaError()
    class BackendLoadFailed(val backendName: String) : LlamaError()
    class ContextInitFailed(cause: Throwable?) : LlamaError()
    class ContextOverflow : LlamaError()
    class DecodeFailed(val code: Int) : LlamaError()
    class NativeException(val nativeMessage: String, cause: Throwable?) : LlamaError()
}

Compose error recovery into the same flow chain:

LlamaEngine.createFlow(modelDefinition)
    .catch { e ->
        when (e) {
            is LlamaError.ModelNotFound   -> showModelPickerUI()
            is LlamaError.ContextOverflow -> onContextFull()
            else                          -> logAndRethrow(e)
        }
    }
    .collect { /* ... */ }

Conversation History

Re-populate the KV cache with a prior conversation before the next turn:

val history: List<ChatEvent> = listOf(
    ChatEvent.UserEvent("What's Kotlin?", think = false),
    ChatEvent.AssistantEvent(listOf(
        ChatEvent.AssistantEvent.Part.TextPart("Kotlin is a JVM language by JetBrains.")
    )),
    ChatEvent.UserEvent("And coroutines?", think = false),
)

chat.feedHistory(history)
chat.completion(ChatEvent.UserEvent("Give me an example.", think = false))
    .collect { snapshot -> /* ... */ }

The session processes history tokens once — subsequent context is pre-warmed and generations are faster.

ResourceState Flow Operators

ResourceState<T> is the lifecycle ADT powering the entire SDK:

sealed class ResourceState<out T> {
    data class Loading(val progress: Float?) : ResourceState<Nothing>()
    data class Success<T>(val value: T) : ResourceState<T>()
    data class Failure(val error: Throwable) : ResourceState<Nothing>()
}

Compose resource flows declaratively using built-in operators:

Operator	Use
`flatMapResource { }`	Chain a resource-loading step onto an existing one
`filterSuccess()`	Strip the wrapper, emit only successful values as `Flow<T>`
`onEachLoading { }`	React to progress without leaving the chain
`onEachSuccess { }`	Side-effect on load completion
`mapSuccess { }`	Transform the inner value
`fold(onLoading, onSuccess, onFailure)`	Exhaustive pattern match
`getOrNull()`	Extract value or `null`
`getOrElse { }`	Extract value or a fallback

Architecture

┌──────────────────────────────────┐
│  LlamaChatSession (Public API)   │  Formatting, stop tokens, metrics
├──────────────────────────────────┤
│  LlamaSession (Public API)       │  KV cache, mutex, token control
├──────────────────────────────────┤
│  LlamaEngine (Public API)        │  Model loading, session factory
├──────────────────────────────────┤
│  JNI Bridge (Internal)           │  C++ ↔ Kotlin, error mapping
├──────────────────────────────────┤
│  llama.cpp (Native C++)          │  GGML, SIMD (NEON, dotprod, i8mm)
└──────────────────────────────────┘

All concrete implementations are internal. The public surface is interface-based. Extensions and wrappers can depend on the interfaces without coupling to the implementation.

Custom Prompt Formats

Any model not in the built-in list can be supported with a custom PromptFormat:

val custom = PromptFormat(
    systemPrefix = "<<SYS>>\n",
    userPrefix = "[INST] ",
    assistantPrefix = "[/INST] ",
    endOfTurn = "</s>\n",
    emitAssistantPrefixOnGeneration = true
)

LlamaEngine.createFlow(
    ModelDefinition(
        loadConfig = ModelLoadConfig("/path/to/model.gguf"),
        promptFormat = custom
    )
)

Roadmap

Streaming Grammar — Force structured JSON/function output from any model
Function Calling — Registered tools that models can invoke during generation
Multi-Model Sessions — Seamlessly switch models mid-conversation
GGUF Metadata — Auto-detect model type and recommended format from file headers

Contributing

Open an issue to discuss non-trivial changes first
Run tests before submitting: ./gradlew :sdk:testDebugUnitTest
Build the release AAR: ./gradlew :sdk:assembleRelease
Follow the Kotlin coding conventions
Keep native code minimal and its intent clear

See CLAUDE.md for architecture deep-dive and build setup.

License

Apache 2.0

If Llama Bro saved you a weekend, give it a ⭐
Built with ❤️ for the Android + Local AI community.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
app		app
assets		assets
gradle		gradle
sdk		sdk
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
jitpack.yml		jitpack.yml
settings.gradle.kts		settings.gradle.kts

Folders and files

Latest commit

History

Repository files navigation

Llama Bro SDK

The Problem with Cloud AI

See it in Action

What's New — Declarative Inference Pipeline

How Pipeline Composition Works

Features

Built-In Prompt Formats

Installation

1. Add JitPack to your repositories

2. Add the dependency

Prerequisites

Download a GGUF Model

Quick Start

API Overview

LlamaEngine — The Model Loader

LlamaSession — The Token Engine

LlamaChatSession — The Chat API

Configuration Reference

ModelDefinition

SessionConfig

InferenceConfig — Sampling

DecodeConfig — Performance

Overflow Strategies

Thinking Blocks & Reasoning Models

Error Handling

Conversation History

ResourceState Flow Operators

Architecture

Custom Prompt Formats

Roadmap

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`LlamaEngine` — The Model Loader

`LlamaSession` — The Token Engine

`LlamaChatSession` — The Chat API

`ModelDefinition`

`SessionConfig`

`InferenceConfig` — Sampling

`DecodeConfig` — Performance

Packages