Hardware-agnostic, auto-optimizing AI assistant with RAG for codebases and optional cloud API acceleration.
Tangi is an AI assistant designed for developers who want fast, hardware-aware inference and code-aware answers. It automatically detects system capabilities (CPU, threads, memory, BLAS backend) and tunes itself for optimal performance.
It includes a built-in Retrieval-Augmented Generation (RAG) system that indexes your codebase, enabling accurate, context-grounded responses.
New in v1.1.0: Tangi now supports Online Mode with NVIDIA NIM API integration, offering cloud-accelerated responses (40 requests/minute free tier, no credit card required) alongside local LLM inference.
Main widget LLM Inference Demo
| Mode | Description | Best For |
|---|---|---|
| Offline (Local) | Run GGUF models on your hardware using llama-cpp-python | Privacy, air-gapped environments, no internet |
| Online (Cloud) | NVIDIA NIM API with OpenAI-compatible endpoints | Speed, complex reasoning, reduced local resource usage |
- Automatic detection of physical vs logical CPU cores
- NUMA-aware scheduling (multi-socket systems)
- OpenBLAS auto-configuration
- Dynamic batch sizing based on RAM
- Optional memory locking to prevent swapping
- Auto-unload local model when switching to online mode (frees 4-6GB RAM)
| Feature | Command | Use Case |
|---|---|---|
| Code Indexing | /index /path |
Index a codebase for semantic search |
| Standard RAG | /search "question" |
Fast, single-pass retrieval for direct questions |
| Deep Search | /ds "question" |
Multi-step iterative search for complex, cross-file analysis |
- Semantic search across codebases
- Multi-project support
- Automatic ignore rules (venv, node_modules, build artifacts)
- Chunk preview before querying
- Markdown and plain chat modes
- Session persistence
- Theme support (dark/light)
- KV cache for faster repeated queries
- Window transparency persistence
- OpenBLAS acceleration
- Thread coordination (avoids BLAS/LLM contention)
- Automatic token budgeting
- Context window management
- NVIDIA NIM (free tier: 40 requests/minute, no credit card)
- OpenAI (GPT-4o, GPT-4o-mini)
- Together AI
- DeepSeek
- Any OpenAI-compatible endpoint
- Python 3.12+
- 8 GB RAM minimum (16 GB recommended)
- OpenBLAS (recommended)
- ~10 GB disk space for models
- Optional: NVIDIA API key for online mode
git clone https://github.com/mreinrt/Tangi.git
cd Tangi
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Install dependencies:
# OpenBLAS
sudo apt install libopenblas-dev # Debian/Ubuntu
# or
sudo emerge -av sci-libs/openblas # Gentoo
pip install -r requirements.txt
# Rebuild llama-cpp-python with OpenBLAS
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \
pip install llama-cpp-python==0.3.16 --no-cache-dir
python -m Tangi
- Load a model (File → Load Model or select a .gguf file)
- Download the embedding model (one-time setup):
/get-rag - Index your codebase:
/index /path/to/your/project - Select the indexed codebase (Manage Index → Select CodeBase)
- Ask questions!
- Get a free NVIDIA API key from build.nvidia.com
- Go to Preferences → NVIDIA NIM Online Mode
- Enter your API key and Base URL (default:
https://integrate.api.nvidia.com/v1) - Select a model (recommended:
mistralai/mistral-nemotron) - Click Test Connection to verify
- Toggle Online Mode in the status bar (bottom right)
- The local model auto-unloads to free RAM
- Chat with cloud acceleration!
Step 1: Make sure to download a local RAG model by using /get-rag or store one inside of ~/.cache/huggingface/hub/. You can also select from a local RAG model by clicking File > Manage Index and clicking "Manage RAG Models".
/index /home/user/projects/codebase
- Open Manage Index (File → Manage Index)
- Select your indexed project
- Click Select CodeBase
- Direct, factual questions
- Single file lookups
- Finding specific functions or classes
- Simple queries
- Now supports online API for faster responses
- Complex, multi-file questions
- System architecture understanding
- Cross-module dependencies
- Troubleshooting complex issues
| Command | Description |
|---|---|
| /about | Application information |
| /help or /commands | Show all commands |
| Command | Description |
|---|---|
| /hf login | Authenticate |
| /hf download MODEL | Download model |
| /hf search QUERY | Search models |
| /hf info MODEL | Model details |
| /hf cache | Cache info |
| Command | Description |
|---|---|
| /index PATH | Index codebase |
| /search QUERY | Fast retrieval (uses online API if available) |
| /ds QUERY | Deep search |
| /get-rag | Download embeddings |
| /remove-index PATH | Remove index |
| /clear | Clear context |
| /rag-status | Status |
| /cache-info | Cache stats |
Query → Retrieve → Context → Local LLM → Answer
Query → Retrieve → Context → Cloud API (NVIDIA NIM) → Answer
Query → Retrieve → Analyze → Refine → Retrieve → ... → Answer
Default: 85% RAM (adjustable in Preferences)
| Setting | Default | Description |
|---|---|---|
| API Base URL | https://integrate.api.nvidia.com/v1 |
Endpoint for cloud API |
| Model | mistralai/mistral-nemotron |
Best for coding (92.68% HumanEval) |
| API Key | User-provided | Get from build.nvidia.com |
optimal_settings = {
'n_threads': auto-detected,
'n_batch': auto-optimized,
'use_mlock': based on system RAM,
}
| System | Threads | Batch | Context |
|---|---|---|---|
| 2–4 cores | = cores | 64–128 | 8K–16K |
| 4–8 cores | cores +25% | 128–256 | 16K–32K |
| 8+ cores | 80–100% | 256–512 | 32K–128K |
| Provider | Free Tier | Speed | Best For |
|---|---|---|---|
| NVIDIA NIM | 40 requests/min | 1-3 seconds | Coding, general use |
| OpenAI | Requires payment method | 1-3 seconds | General purpose |
| Together AI | Free credits | 1-3 seconds | Various open models |
- NVIDIA NIM API integration with online/offline toggle
- Auto-unload local model when switching to online mode
- Window transparency persistence across sessions
- API Base URL configuration in preferences
- Universal OnlineAPIClient (supports any OpenAI-compatible endpoint)
- RAG search now uses online API when available (faster)
- Centered response settings buttons in preferences
- Missing
live_transparency_changemethod
MIT License.
- llama-cpp-python
- sentence-transformers
- OpenBLAS
- NVIDIA NIM for free API access
GitHub Issues for bugs and requests.
BTC: 3GtCgHhMP7NTxsdNjcDs7TUNSBK6EXoAzz
ETH: 0x5f1ed610a96c648478a775644c9244bf4e78631e
Built by Michael Reinert
