NVIDIA NIM model latency benchmarker, written in Nim.
nimakai (నిమ్మకాయి) = lemon in Telugu. NIM + Nim = nimakai.
A focused, single-binary tool that continuously pings NVIDIA NIM models and reports latency metrics. Includes an 80-model tiered catalog, recommendation engine for oh-my-opencode routing, watch mode with alerts, CI health checks, live model discovery, and full sync mode. No bloat, no TUI framework, no telemetry. Just latency numbers.
Also includes nimaproxy — a Rust-based key-rotation proxy for production use.
- Latest — most recent round-trip time
- Avg — rolling average (ring buffer, last 100 samples)
- P50 — median latency
- P95 — 95th percentile (tail spikes)
- P99 — 99th percentile (worst case)
- Jitter — standard deviation (consistency)
- Stability — composite score 0-100 (P95 + jitter + spike rate + reliability)
- Health — UP / TIMEOUT / OVERLOADED / ERROR / NO_KEY / NOT_FOUND
- Verdict — Perfect / Normal / Slow / Spiky / Very Slow / Unstable / Not Active / Not Found
- Up% — uptime percentage
- Tier — S+ / S / A+ / A / A- / B+ / B / C (based on SWE-bench Verified scores)
git clone https://github.com/dirmacs/nimakai.git
cd nimakai
nimble buildRequires Nim 2.0+ and OpenSSL.
export NVIDIA_API_KEY="nvapi-..."
# Continuous monitoring (S+ and S tier models by default)
nimakai
# Single round, then exit
nimakai --once
# Specific models only
nimakai -m qwen/qwen3.5-122b-a10b,qwen/qwen3.5-397b-a17b
# Filter by tier
nimakai --tier A --once
# Sort by stability score
nimakai --sort stability
# Benchmark models from opencode.json
nimakai --opencode --once
# JSON output
nimakai --once --jsonnimakai Continuous benchmark (default)
nimakai catalog List all 46 known models with tiers and metadata
nimakai recommend Benchmark and recommend routing changes
nimakai watch Monitor OMO-routed models with alerts
nimakai check CI health check with exit codes
nimakai discover Compare API models against catalog
nimakai history Show historical benchmark data
nimakai trends Show latency trend analysis (improving/degrading/stable)
nimakai opencode Show models from opencode.json + OMO routing
nimakai proxy start Start nimaproxy daemon (FFI integration)
nimakai proxy stop Stop nimaproxy daemon
nimakai proxy status Show nimaproxy live stats
nimakai can benchmark models and recommend optimal routing for oh-my-opencode categories:
# Advisory: show recommendations
nimakai recommend --rounds 3
# Full sync: backup -> diff -> apply to oh-my-opencode.json
nimakai recommend --rounds 5 --apply
# Rollback to previous config
nimakai recommend --rollbackEach OMO category is scored using weighted criteria:
| Category Need | SWE Weight | Speed Weight | Stability Weight |
|---|---|---|---|
| Speed (quick) | 0.15 | 0.55 | 0.20 |
| Quality (deep, artistry) | 0.45 | 0.10 | 0.20 |
| Reliability (ultrabrain) | 0.25 | 0.20 | 0.40 |
| Vision (visual-engineering) | 0.30 | 0.20 | 0.30 |
| Balance (writing, default) | 0.30 | 0.30 | 0.25 |
| Key | Action |
|---|---|
A |
Sort by average latency |
P |
Sort by P95 latency |
S |
Sort by stability score |
T |
Sort by tier |
N |
Sort by model name |
U |
Sort by uptime % |
1-9 |
Toggle favorite on Nth model |
Q |
Quit |
nimakai v0.10.0 includes FFI integration with nimaproxy, allowing you to start/stop/query the Rust key-rotation proxy directly from the Nim CLI:
# Start the proxy daemon
nimakai proxy start --proxy-config /path/to/nimaproxy.toml --proxy-port 8080
# Check live status
nimakai proxy status
# Stop the daemon
nimakai proxy stopRequirements:
libnimaproxy.somust be in the same directory as nimakai binary, orLD_LIBRARY_PATHmust be set- nimaproxy config file with API keys (see nimaproxy section below)
Status output shows:
- Overall health status
- Active key count
- Routing and racing configuration
- Per-key status (active/cooldown, key hint)
- Per-model latency stats (avg, P95, success rate, degradation)
| Flag | Short | Description | Default |
|---|---|---|---|
--once |
-1 |
Single round, then exit | continuous |
--models |
-m |
Comma-separated model IDs | S/S+ tier |
--interval |
-i |
Ping interval in seconds | 5 |
--timeout |
-t |
Request timeout in seconds | 15 |
--json |
-j |
JSON output | table |
--tier |
Filter by tier family (S, A, B, C) | ||
--sort |
Sort: avg, p95, stability, tier, name, uptime | avg | |
--opencode |
Use models from opencode.json | ||
--rounds |
-r |
Benchmark rounds for recommend | 3 |
--apply |
Apply recommendations to oh-my-opencode.json | ||
--rollback |
Rollback oh-my-opencode.json from backup | ||
--quiet |
-q |
Suppress stderr status messages | |
--no-history |
Don't write to history file | ||
--dry-run |
Preview recommend changes without applying | ||
--rec-history |
Show recommendation history | ||
--throughput |
Measure output token throughput | ||
--alert-threshold |
Alert threshold for watch mode | 50 | |
--fail-if-degraded |
Exit 1 if any model is degraded (check mode) | ||
--days |
-d |
Days of history to show | 7 |
--profile |
Load named profile from config | ||
--help |
-h |
Show help | |
--version |
-v |
Show version |
Optional config at ~/.config/nimakai/config.json:
{
"interval": 5,
"timeout": 15,
"thresholds": {
"perfect_avg": 400,
"perfect_p95": 800,
"normal_avg": 1000,
"normal_p95": 2000,
"spike_ms": 3000
},
"profiles": {
"work": { "interval": 10, "tier_filter": "S", "rounds": 5 },
"fast": { "timeout": 5 }
},
"favorites": []
}Use profiles with nimakai --profile work to load pre-configured settings.
Custom models can be added via ~/.config/nimakai/models.json to extend the built-in catalog.
History is persisted to ~/.local/share/nimakai/history.jsonl (30-day auto-prune).
src/
nimakai.nim Entry point, main loop, SIGINT handler
nimakai/
types.nim Types, enums, constants
cli.nim CLI argument parsing with profiles
metrics.nim Pure metric functions (avg, p50, p95, p99, jitter, stability)
ping.nim HTTP ping + throughput measurement
catalog.nim 80-model catalog with SWE-bench tiers, O(1) index
display.nim Table/JSON rendering, ANSI helpers
config.nim Config file persistence + profile loading
history.nim JSONL history persistence + trend detection
opencode.nim OpenCode + oh-my-opencode integration
recommend.nim Recommendation engine (categories + agents + uptime)
rechistory.nim Recommendation history tracking (JSONL)
sync.nim Backup, apply, rollback for OMO config
watch.nim Watch mode alerting (down/recovered/degraded)
discovery.nim Live model discovery from NVIDIA API
tests/
test_types.nim 9 tests
test_metrics.nim 41 tests
test_display.nim 32 tests
test_ping.nim 15 tests
test_catalog.nim 22 tests
test_config.nim 12 tests
test_opencode.nim 5 tests
test_recommend.nim 34 tests
test_sync.nim 17 tests
test_history.nim 28 tests
test_rechistory.nim 9 tests
test_watch.nim 8 tests
test_integration.nim 12 tests
test_discovery.nim 9 tests
test_cli.nim 72 tests
nimaproxy/
Cargo.toml lib + bin + tests
nimaproxy.toml Config (NOT committed - contains API keys)
nimaproxy.toml.example Template for users
.gitignore Excludes nimaproxy.toml
src/
lib.rs Exports modules + AppState
main.rs Binary entry point
config.rs TOML config parsing
key_pool.rs Key rotation, rate-limit tracking
model_stats.rs Per-model latency tracking
model_router.rs Latency-aware model selection
proxy.rs HTTP handlers
tests/
integration.rs 12 integration tests
e2e_live.rs 6 E2E tests with real NVIDIA API
stress_test.rs 25-turn live stress test
Standalone Rust binary for production use. Provides OpenAI-compatible API with key rotation and latency-aware routing.
cd nimaproxy
cargo build --release
# Copy and edit config
cp nimaproxy.toml.example nimaproxy.toml
# Edit nimaproxy.toml with your NVIDIA API keys
# Run
./target/release/nimaproxy --config nimaproxy.tomlEndpoints:
GET /health— Key pool statusGET /stats— Per-model latency statsGET /v1/models— Passthrough to NVIDIAPOST /v1/chat/completions— Proxy with key rotation
Features:
- Round-robin key rotation across multiple API keys
- Automatic 429 handling with per-key cooldown
- Latency-aware model routing (
"model": "auto") - Per-model stats tracking (TTFC, success rate, degradation detection)
x-key-labelresponse header: tracks which key was used for rotation debugging
Model Routing (V2):
[routing]
strategy = "latency_aware"
spike_threshold_ms = 3000
models = [
"nvidia/meta/llama-3.3-70b-instruct",
"nvidia/qwen/qwen2.5-coder-32b-instruct",
"nvidia/moonshotai/kimi-k2-instruct",
"nvidia/mistralai/devstral-2-123b-instruct-2512",
]When a request arrives with "model": "auto", the proxy picks the best model from this list. Untried models (< 3 samples) get priority. Degraded models (≥3 consecutive failures or avg > spike_threshold_ms) are skipped.
Model Racing (Speculative Execution):
[racing]
enabled = true
models = [
"moonshotai/kimi-k2",
"minimaxai/minimax-m2.5",
"qwen/qwen3-next-80b-a3b-thinking",
"stepfun-ai/step-3.5-flash",
"qwen/qwen3.5-122b-a10b",
"qwen/qwen3-coder-480b-a35b-instruct",
"google/gemma-3-27b-it",
"qwen/qwen2.5-coder-32b-instruct",
]
max_parallel = 8
timeout_ms = 15000
strategy = "complete"Fires N parallel requests to N models, returns first response. Trades N×token budget for min(P50 latency). Keys are pre-allocated per race task to avoid 429 rate-limit collisions. Models are selected in round-robin order via racing_cursor to prevent a single fast model from dominating and breaking inference loops. Dead models (≥20 consecutive failures or 0 samples) are filtered out automatically.
Model Compatibility (Developer Role Transformation):
[model_compat]
# Models that support the 'developer' role (don't need transformation)
# All models NOT in this list will have 'developer' role transformed to 'user'
supports_developer_role = []
# Models that support tool messages (don't need transformation)
# All models NOT in this list will have 'tool' role transformed to 'assistant'
supports_tool_messages = []Transforms OpenAI-style developer and tool roles to user and assistant for models that don't support them. This fixes 400 "Unknown message role" errors when using OMP or other agents that send developer role messages. By default, all models have roles transformed (empty lists = transform all).
MIT