Fast-Editor · veerareddyvishal144 · Feb 23, 2026 · Feb 22, 2026
diff --git a/.env.example b/.env.example
@@ -10,20 +10,66 @@
 # Default: databricks
 MODEL_PROVIDER=ollama
 
+# Default model to use (overrides provider-specific defaults)
+# Without this, defaults to a Databricks Claude model regardless of MODEL_PROVIDER.
+# Set this when using Ollama or other local providers to match your installed model.
+# MODEL_DEFAULT=qwen2.5-coder:latest
+
+# Force server-side model configuration, ignoring client requests
+# When enabled, the server will always use MODEL_DEFAULT regardless of what model the client requests
+# Useful when you want to enforce a specific model (e.g., qwen/qwen3-coder-next) despite clients asking for Claude models
+# Default: false (respect client model preferences)
+# ENFORCE_SERVER_MODEL=false
+
 # ==============================================================================
 # Ollama Configuration (Hybrid Routing)
 # ==============================================================================
+#
+# Three supported configurations:
+#
+# 1. LOCAL ONLY (free, runs on your machine):
+#    OLLAMA_ENDPOINT=http://localhost:11434
+#    OLLAMA_MODEL=qwen2.5-coder:latest
+#
+# 2. CLOUD ONLY (no local Ollama needed):
+#    OLLAMA_CLOUD_ENDPOINT=https://ollama.com
+#    OLLAMA_MODEL=glm-4.7:cloud
+#    OLLAMA_API_KEY=your-ollama-cloud-api-key
+#
+# 3. MIXED (local chat + cloud tools, or vice versa):
+#    OLLAMA_ENDPOINT=http://192.168.100.201:11434
+#    OLLAMA_MODEL=qwen3:0.6b
+#    OLLAMA_CLOUD_ENDPOINT=https://ollama.com
+#    OLLAMA_API_KEY=your-ollama-cloud-api-key
+#    TOOL_EXECUTION_MODEL=glm-4.7:cloud
+#
+# OLLAMA_MODEL is REQUIRED — there is no default.
+# At least one of OLLAMA_ENDPOINT or OLLAMA_CLOUD_ENDPOINT is REQUIRED.
+# Cloud models are identified by name: "model:cloud" or "model:tag-cloud".
+# ==============================================================================
 
 # Enable Ollama preference for simple requests
 PREFER_OLLAMA=false
 
-# Ollama model to use (must be compatible with tool calling)
-# Options: qwen2.5-coder:latest, llama3.1, mistral-nemo, nemotron-3-nano:30b-cloud, etc.
+# Ollama model to use (REQUIRED, no default)
+# Options: qwen2.5-coder:latest, llama3.1, glm-4.7:cloud, etc.
 OLLAMA_MODEL=qwen2.5-coder:latest
 
-# Ollama endpoint (default: http://localhost:11434)
+# Ollama local endpoint (omit for cloud-only setups)
 OLLAMA_ENDPOINT=http://localhost:11434
 
+# Ollama cloud endpoint for cloud-hosted models
+# Models with ":cloud" or "-cloud" in the name route here with Authorization header.
+# OLLAMA_CLOUD_ENDPOINT=https://ollama.com
+# OLLAMA_API_KEY=your-ollama-cloud-api-key
+
+# Ollama request timeout in milliseconds (default: 120000)
+# OLLAMA_TIMEOUT_MS=120000
+
+# Ollama keep_alive parameter - how long to keep model loaded in memory
+# Values: -1 (forever), 0 (unload immediately), or duration string (e.g. "5m", "1h")
+# OLLAMA_KEEP_ALIVE=-1
+
 # Ollama embeddings configuration (for Cursor @Codebase semantic search)
 # Embedding models for local, privacy-first semantic search
 # Popular models:
@@ -203,7 +249,12 @@ WEB_SEARCH_ENDPOINT=http://localhost:8888/search
 # Policy Configuration
 POLICY_MAX_STEPS=20
 POLICY_MAX_TOOL_CALLS=12
+# Max tool calls the model can make in a single LLM request (not per turn).
+# Prevents runaway parallel tool calling within one response.
+POLICY_MAX_TOOL_CALLS_PER_REQUEST=12
 
+# Max duration (ms) for a single agent loop turn. Increase for slow models.
+POLICY_MAX_DURATION_MS=120000
 # Tool loop guard - max tool results in conversation before force-terminating
 # Prevents infinite tool loops. Set higher for complex multi-step tasks.
 POLICY_TOOL_LOOP_THRESHOLD=10
@@ -217,6 +268,66 @@ WORKSPACE_INDEX_ENABLED=true
 # - client/passthrough: Return tool calls to CLI for local execution
 TOOL_EXECUTION_MODE=server
 
+# ==============================================================================
+# Tool Execution Provider Configuration
+# ==============================================================================
+
+# Provider to use for tool calling decisions (optional)
+# If set, tool-calling decisions will be routed to this provider
+# while conversation stays with the main provider.
+# This enables using cheap/fast/local models for chat while using
+# reliable models (like Claude Sonnet) for tool calling.
+#
+# Options: databricks, azure-anthropic, openrouter, openai, bedrock, etc.
+# Leave empty to use the main provider for both conversation and tools.
+# TOOL_EXECUTION_PROVIDER=
+
+# Model to use for tool execution (optional)
+# If not set, uses the provider's default model.
+# Examples:
+#   - anthropic/claude-sonnet-4 (for OpenRouter)
+#   - claude-3-5-sonnet-20241022 (for OpenAI-compatible APIs)
+# TOOL_EXECUTION_MODEL=
+
+# Enable comparison mode to call BOTH providers and compare their tool calls
+# When enabled, both the conversation provider and tool execution provider
+# will be called, and their tool calls will be compared. The better set
+# of tool calls will be selected based on quality heuristics.
+# Useful for evaluating which provider gives better tool calls.
+# Default: false (only uses tool execution provider when configured)
+# TOOL_EXECUTION_COMPARE_MODE=false
+
+# ==============================================================================
+# Tool Execution Provider Examples
+# ==============================================================================
+
+# Example 1: Use local Ollama for chat, Claude Sonnet for tool calling
+# This provides fast, private conversation with reliable tool execution.
+#
+# MODEL_PROVIDER=ollama
+# OLLAMA_MODEL=qwen3-coder-next
+# TOOL_EXECUTION_PROVIDER=openrouter
+# TOOL_EXECUTION_MODEL=anthropic/claude-sonnet-4
+# OPENROUTER_API_KEY=your-key-here
+
+# Example 2: Use qwen3-coder-next for conversation, compare with Claude for tools
+# This lets you evaluate tool call quality differences between providers.
+#
+# MODEL_PROVIDER=ollama
+# OLLAMA_MODEL=qwen3-coder-next
+# TOOL_EXECUTION_PROVIDER=openrouter
+# TOOL_EXECUTION_MODEL=anthropic/claude-sonnet-4
+# TOOL_EXECUTION_COMPARE_MODE=true
+# OPENROUTER_API_KEY=your-key-here
+
+# Example 3: Use Databricks for chat, Bedrock Claude for tools
+# This enables cost optimization between different cloud providers.
+#
+# MODEL_PROVIDER=databricks
+# TOOL_EXECUTION_PROVIDER=bedrock
+# TOOL_EXECUTION_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0
+# AWS_BEDROCK_API_KEY=your-key-here
+
 # Suggestion mode model override
 # Controls which model handles suggestion mode (predicting next user input).
 # Values:
@@ -225,10 +336,23 @@ TOOL_EXECUTION_MODE=server
 #   <model> - Use a specific model (e.g. "llama3.1" for a lighter model)
 SUGGESTION_MODE_MODEL=default
 
+# Topic detection model override
+# Redirects topic classification to a lighter/faster model to reduce GPU load.
+# Values:
+#   default - Use the main model (no change)
+#   skip/none - Skip topic detection entirely (recommended for Ollama to avoid GPU contention)
+#   <model> - Use a specific model (e.g. "llama3.2:1b" for a lightweight classifier)
+TOPIC_DETECTION_MODEL=default
+
 # Enable/disable automatic tool injection for local models
 INJECT_TOOLS_LLAMACPP=true
 INJECT_TOOLS_OLLAMA=true
 
+# Aggressive tool patching: try ALL text-to-tool extraction strategies for any model
+# When false (default), only model-specific strategies are used (e.g. GLM gets bullet-point
+# and fenced-code-block extraction). When true, all strategies are tried for every model.
+# AGGRESSIVE_TOOL_PATCHING=false
+
 # ==============================================================================
 # Semantic Response Cache
 # ==============================================================================
@@ -276,6 +400,24 @@ MEMORY_INJECTION_FORMAT=system
 # Enable automatic extraction
 MEMORY_EXTRACTION_ENABLED=true
 
+# ==============================================================================
+# TOON JSON→TOON Prompt Compression
+# ==============================================================================
+
+# Enable TOON compression for large structured JSON contexts (opt-in)
+# TOON provides 10-39% payload reduction for API calls with large tool outputs
+# Safe fallback: if compression fails, original payload is sent
+TOON_ENABLED=false
+
+# Minimum payload size in bytes before attempting TOON compression
+TOON_MIN_BYTES=4096
+
+# Fail open strategy: if compression fails, send original (true) or fail request (false)
+TOON_FAIL_OPEN=true
+
+# Log compression statistics (useful for monitoring)
+TOON_LOG_STATS=true
+
 # ==============================================================================
 # Token Optimization Settings (60-80% Cost Reduction)
 # ==============================================================================
@@ -294,12 +436,6 @@ TOKEN_BUDGET_WARNING=100000
 TOKEN_BUDGET_MAX=180000
 TOKEN_BUDGET_ENFORCEMENT=true
 
-# TOON JSON->TOON prompt compression (opt-in; for large structured JSON context)
-TOON_ENABLED=false
-TOON_MIN_BYTES=4096
-TOON_FAIL_OPEN=true
-TOON_LOG_STATS=true
-
 # ==============================================================================
 # Smart Tool Selection (Advanced Token Optimization)
 # ==============================================================================
@@ -310,6 +446,49 @@ SMART_TOOL_SELECTION_MODE=heuristic
 # Maximum token budget for tools per request
 SMART_TOOL_SELECTION_TOKEN_BUDGET=2500
 
+# ==============================================================================
+# Tool Needs Classification (LLM-Based)
+# ==============================================================================
+
+# Enable LLM-based tool needs classification
+# When enabled, classifies requests as tool-needed or conversational before routing
+# This reduces token overhead and latency for simple questions and greetings
+# Default: false (disabled)
+# TOOL_NEEDS_CLASSIFICATION_ENABLED=true
+
+# Model to use for classification (lightweight recommended)
+# Options: qwen2.5:1b, llama3.2:1b, or "skip" to disable LLM (whitelist-only)
+# Lightweight models are fast (~100-500ms) and accurate for yes/no classification
+# Default: qwen2.5:1b
+# TOOL_NEEDS_CLASSIFICATION_MODEL=qwen2.5:1b
+
+# Path to default whitelist file with known patterns
+# Whitelist provides fast-path matching for common requests (no LLM call)
+# Default: ./config/tool-whitelist.json
+# TOOL_NEEDS_CLASSIFICATION_WHITELIST=./config/tool-whitelist.json
+
+# Path to user whitelist file (optional, extends default)
+# Create your own patterns in a separate file to avoid conflicts on updates
+# Example: ./config/tool-whitelist-user.json
+# TOOL_NEEDS_CLASSIFICATION_USER_WHITELIST=./config/tool-whitelist-user.json
+
+# Custom shell commands that need tools (comma-separated)
+# Automatically adds both "command" and "command *" patterns to whitelist
+# Example: bd,mycommand,anothercmd
+# Use this for project-specific CLIs like bd (beads), make, cargo, etc.
+# Default: empty
+# TOOL_NEEDS_CLASSIFICATION_CUSTOM_COMMANDS=bd
+
+# Enable result caching to avoid repeated classification
+# Caches results by normalized message content
+# Default: true
+# TOOL_NEEDS_CLASSIFICATION_CACHE_ENABLED=true
+
+# Enable LLM fallback when whitelist doesn't match
+# If false, defaults to "needs tools" when whitelist misses
+# Default: true
+# TOOL_NEEDS_CLASSIFICATION_LLM_ENABLED=true
+
 # ==============================================================================
 # Performance & Security
 # ==============================================================================
@@ -334,6 +513,37 @@ HOT_RELOAD_ENABLED=true
 # Debounce delay in ms (prevents rapid reloads)
 HOT_RELOAD_DEBOUNCE_MS=1000
 
+# ==============================================================================
+# LLM Audit Logging
+# ==============================================================================
+
+# Enable LLM audit logging (default: false)
+# Logs all LLM requests/responses for debugging and analysis
+LLM_AUDIT_ENABLED=false
+
+# Audit log file path
+# LLM_AUDIT_LOG_FILE=./logs/llm-audit.log
+
+# Maximum content length per field (characters) - controls log file size
+# LLM_AUDIT_MAX_CONTENT_LENGTH=5000
+# LLM_AUDIT_MAX_SYSTEM_LENGTH=2000
+# LLM_AUDIT_MAX_USER_LENGTH=3000
+# LLM_AUDIT_MAX_RESPONSE_LENGTH=3000
+
+# Log rotation settings
+# LLM_AUDIT_MAX_FILES=30
+# LLM_AUDIT_MAX_SIZE=100M
+
+# Include annotation metadata in audit logs (default: true)
+# LLM_AUDIT_ANNOTATIONS=true
+
+# Deduplication - reduces log size by referencing repeated content
+# LLM_AUDIT_DEDUP_ENABLED=true
+# LLM_AUDIT_DEDUP_MIN_SIZE=500
+# LLM_AUDIT_DEDUP_CACHE_SIZE=100
+# LLM_AUDIT_DEDUP_SANITIZE=true
+# LLM_AUDIT_DEDUP_SESSION_CACHE=true
+
 # ==============================================================================
 # Quick Start Examples
 # ==============================================================================

diff --git a/README.md b/README.md
@@ -287,6 +287,7 @@ Lynkr supports [ClawdBot](https://github.com/openclaw/openclaw) via its OpenAI-c
 - ✅ **Streaming Support** - Real-time token streaming for all providers
 - ✅ **Memory System** - Titans-inspired long-term memory with surprise-based filtering
 - ✅ **Tool Calling** - Full tool support with server and passthrough execution modes
+- ✅ **Progress Reporting** - Real-time agent execution tracking with WebSocket broadcasting (port 8765)
 - ✅ **Production Ready** - Battle-tested with 400+ tests, observability, and error resilience
 - ✅ **Node 20-25 Support** - Works with latest Node.js versions including v25
 - ✅ **Semantic Caching** - Cache responses for similar prompts (requires embeddings)
@@ -318,6 +319,76 @@ OLLAMA_EMBEDDINGS_ENDPOINT=http://localhost:11434/api/embeddings
 
 ---
 
+## Progress Reporting
+
+Lynkr emits **real-time progress events** throughout agent execution, enabling comprehensive monitoring of tool execution, model invocations, and reasoning steps. These events are:
+- Emitted as Node.js events internally
+- Automatically broadcasted via WebSocket (port 8765) for external clients
+- Logged for observability
+
+**WebSocket Server (for External Clients):**
+- **Port**: `8765`
+- **Endpoint**: `ws://localhost:8765`
+- **Required Dependency**: `ws` (auto-installed with `npm install`)
+
+**Events Emitted:**
+- `agent_loop_started` / `agentLoopCompleted` - Agent execution lifecycle
+- `agent_loop_step_started` - Individual step in agent reasoning
+- `model_invocation_started` / `modelInvocationCompleted` - LLM calls with provider info
+- `tool_execution_started` / `toolExecutionCompleted` - Tool execution with request/response previews
+
+**Built-in Progress Listener (Python):**
+
+Lynkr includes a ready-to-use Python client that connects to the WebSocket server and displays formatted progress updates:
+
+```bash
+# Install Python dependencies (one-time)
+pip install websockets
+
+# Run the listener in one terminal
+python tools/progress-listener.py
+
+# In another terminal, run Lynkr and Claude Code
+npm start
+claude "Your prompt"
+```
+
+**Features:**
+- 🎨 Color-coded output with timestamps
+- 🔄 Real-time agent hierarchy tracking (shows parent/child agent relationships)
+- ⏱️ Duration and token tracking for model invocations
+- 🛠️ Tool execution details with request/response previews
+- 🌐 Remote monitoring: `python tools/progress-listener.py --host 192.168.1.100`
+- 🔧 Environment variables: `LYNKR_PROGRESS_HOST` and `LYNKR_PROGRESS_PORT`
+
+**Custom Python Client Example:**
+```python
+import json
+import asyncio
+import websockets
+
+async def monitor_progress():
+    uri = "ws://localhost:8765"
+    async with websockets.connect(uri) as websocket:
+        while True:
+            event = await websocket.recv()
+            data = json.loads(event)
+            print(f"Event: {data['type']}")
+            print(f"Data: {json.dumps(data, indent=2)}")
+
+asyncio.run(monitor_progress())
+```
+
+**Use Cases:**
+- Monitor tool execution in real-time during Claude Code CLI runs
+- Track agent reasoning steps and model invocations
+- Build custom dashboards showing agent progress
+- Debug multi-step agentic workflows
+- Troubleshoot subagent spawning and routing
+- Monitor remote Lynkr instances
+
+---
+
 ## Architecture
 
 ```