Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 219 additions & 9 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,66 @@
# Default: databricks
MODEL_PROVIDER=ollama

# Default model to use (overrides provider-specific defaults)
# Without this, defaults to a Databricks Claude model regardless of MODEL_PROVIDER.
# Set this when using Ollama or other local providers to match your installed model.
# MODEL_DEFAULT=qwen2.5-coder:latest

# Force server-side model configuration, ignoring client requests
# When enabled, the server will always use MODEL_DEFAULT regardless of what model the client requests
# Useful when you want to enforce a specific model (e.g., qwen/qwen3-coder-next) despite clients asking for Claude models
# Default: false (respect client model preferences)
# ENFORCE_SERVER_MODEL=false

# ==============================================================================
# Ollama Configuration (Hybrid Routing)
# ==============================================================================
#
# Three supported configurations:
#
# 1. LOCAL ONLY (free, runs on your machine):
# OLLAMA_ENDPOINT=http://localhost:11434
# OLLAMA_MODEL=qwen2.5-coder:latest
#
# 2. CLOUD ONLY (no local Ollama needed):
# OLLAMA_CLOUD_ENDPOINT=https://ollama.com
# OLLAMA_MODEL=glm-4.7:cloud
# OLLAMA_API_KEY=your-ollama-cloud-api-key
#
# 3. MIXED (local chat + cloud tools, or vice versa):
# OLLAMA_ENDPOINT=http://192.168.100.201:11434
# OLLAMA_MODEL=qwen3:0.6b
# OLLAMA_CLOUD_ENDPOINT=https://ollama.com
# OLLAMA_API_KEY=your-ollama-cloud-api-key
# TOOL_EXECUTION_MODEL=glm-4.7:cloud
#
# OLLAMA_MODEL is REQUIRED — there is no default.
# At least one of OLLAMA_ENDPOINT or OLLAMA_CLOUD_ENDPOINT is REQUIRED.
# Cloud models are identified by name: "model:cloud" or "model:tag-cloud".
# ==============================================================================

# Enable Ollama preference for simple requests
PREFER_OLLAMA=false

# Ollama model to use (must be compatible with tool calling)
# Options: qwen2.5-coder:latest, llama3.1, mistral-nemo, nemotron-3-nano:30b-cloud, etc.
# Ollama model to use (REQUIRED, no default)
# Options: qwen2.5-coder:latest, llama3.1, glm-4.7:cloud, etc.
OLLAMA_MODEL=qwen2.5-coder:latest

# Ollama endpoint (default: http://localhost:11434)
# Ollama local endpoint (omit for cloud-only setups)
OLLAMA_ENDPOINT=http://localhost:11434

# Ollama cloud endpoint for cloud-hosted models
# Models with ":cloud" or "-cloud" in the name route here with Authorization header.
# OLLAMA_CLOUD_ENDPOINT=https://ollama.com
# OLLAMA_API_KEY=your-ollama-cloud-api-key

# Ollama request timeout in milliseconds (default: 120000)
# OLLAMA_TIMEOUT_MS=120000

# Ollama keep_alive parameter - how long to keep model loaded in memory
# Values: -1 (forever), 0 (unload immediately), or duration string (e.g. "5m", "1h")
# OLLAMA_KEEP_ALIVE=-1

# Ollama embeddings configuration (for Cursor @Codebase semantic search)
# Embedding models for local, privacy-first semantic search
# Popular models:
Expand Down Expand Up @@ -203,7 +249,12 @@ WEB_SEARCH_ENDPOINT=http://localhost:8888/search
# Policy Configuration
POLICY_MAX_STEPS=20
POLICY_MAX_TOOL_CALLS=12
# Max tool calls the model can make in a single LLM request (not per turn).
# Prevents runaway parallel tool calling within one response.
POLICY_MAX_TOOL_CALLS_PER_REQUEST=12

# Max duration (ms) for a single agent loop turn. Increase for slow models.
POLICY_MAX_DURATION_MS=120000
# Tool loop guard - max tool results in conversation before force-terminating
# Prevents infinite tool loops. Set higher for complex multi-step tasks.
POLICY_TOOL_LOOP_THRESHOLD=10
Expand All @@ -217,6 +268,66 @@ WORKSPACE_INDEX_ENABLED=true
# - client/passthrough: Return tool calls to CLI for local execution
TOOL_EXECUTION_MODE=server

# ==============================================================================
# Tool Execution Provider Configuration
# ==============================================================================

# Provider to use for tool calling decisions (optional)
# If set, tool-calling decisions will be routed to this provider
# while conversation stays with the main provider.
# This enables using cheap/fast/local models for chat while using
# reliable models (like Claude Sonnet) for tool calling.
#
# Options: databricks, azure-anthropic, openrouter, openai, bedrock, etc.
# Leave empty to use the main provider for both conversation and tools.
# TOOL_EXECUTION_PROVIDER=

# Model to use for tool execution (optional)
# If not set, uses the provider's default model.
# Examples:
# - anthropic/claude-sonnet-4 (for OpenRouter)
# - claude-3-5-sonnet-20241022 (for OpenAI-compatible APIs)
# TOOL_EXECUTION_MODEL=

# Enable comparison mode to call BOTH providers and compare their tool calls
# When enabled, both the conversation provider and tool execution provider
# will be called, and their tool calls will be compared. The better set
# of tool calls will be selected based on quality heuristics.
# Useful for evaluating which provider gives better tool calls.
# Default: false (only uses tool execution provider when configured)
# TOOL_EXECUTION_COMPARE_MODE=false

# ==============================================================================
# Tool Execution Provider Examples
# ==============================================================================

# Example 1: Use local Ollama for chat, Claude Sonnet for tool calling
# This provides fast, private conversation with reliable tool execution.
#
# MODEL_PROVIDER=ollama
# OLLAMA_MODEL=qwen3-coder-next
# TOOL_EXECUTION_PROVIDER=openrouter
# TOOL_EXECUTION_MODEL=anthropic/claude-sonnet-4
# OPENROUTER_API_KEY=your-key-here

# Example 2: Use qwen3-coder-next for conversation, compare with Claude for tools
# This lets you evaluate tool call quality differences between providers.
#
# MODEL_PROVIDER=ollama
# OLLAMA_MODEL=qwen3-coder-next
# TOOL_EXECUTION_PROVIDER=openrouter
# TOOL_EXECUTION_MODEL=anthropic/claude-sonnet-4
# TOOL_EXECUTION_COMPARE_MODE=true
# OPENROUTER_API_KEY=your-key-here

# Example 3: Use Databricks for chat, Bedrock Claude for tools
# This enables cost optimization between different cloud providers.
#
# MODEL_PROVIDER=databricks
# TOOL_EXECUTION_PROVIDER=bedrock
# TOOL_EXECUTION_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0
# AWS_BEDROCK_API_KEY=your-key-here

# Suggestion mode model override
# Controls which model handles suggestion mode (predicting next user input).
# Values:
Expand All @@ -225,10 +336,23 @@ TOOL_EXECUTION_MODE=server
# <model> - Use a specific model (e.g. "llama3.1" for a lighter model)
SUGGESTION_MODE_MODEL=default

# Topic detection model override
# Redirects topic classification to a lighter/faster model to reduce GPU load.
# Values:
# default - Use the main model (no change)
# skip/none - Skip topic detection entirely (recommended for Ollama to avoid GPU contention)
# <model> - Use a specific model (e.g. "llama3.2:1b" for a lightweight classifier)
TOPIC_DETECTION_MODEL=default

# Enable/disable automatic tool injection for local models
INJECT_TOOLS_LLAMACPP=true
INJECT_TOOLS_OLLAMA=true

# Aggressive tool patching: try ALL text-to-tool extraction strategies for any model
# When false (default), only model-specific strategies are used (e.g. GLM gets bullet-point
# and fenced-code-block extraction). When true, all strategies are tried for every model.
# AGGRESSIVE_TOOL_PATCHING=false

# ==============================================================================
# Semantic Response Cache
# ==============================================================================
Expand Down Expand Up @@ -276,6 +400,24 @@ MEMORY_INJECTION_FORMAT=system
# Enable automatic extraction
MEMORY_EXTRACTION_ENABLED=true

# ==============================================================================
# TOON JSON→TOON Prompt Compression
# ==============================================================================

# Enable TOON compression for large structured JSON contexts (opt-in)
# TOON provides 10-39% payload reduction for API calls with large tool outputs
# Safe fallback: if compression fails, original payload is sent
TOON_ENABLED=false

# Minimum payload size in bytes before attempting TOON compression
TOON_MIN_BYTES=4096

# Fail open strategy: if compression fails, send original (true) or fail request (false)
TOON_FAIL_OPEN=true

# Log compression statistics (useful for monitoring)
TOON_LOG_STATS=true

# ==============================================================================
# Token Optimization Settings (60-80% Cost Reduction)
# ==============================================================================
Expand All @@ -294,12 +436,6 @@ TOKEN_BUDGET_WARNING=100000
TOKEN_BUDGET_MAX=180000
TOKEN_BUDGET_ENFORCEMENT=true

# TOON JSON->TOON prompt compression (opt-in; for large structured JSON context)
TOON_ENABLED=false
TOON_MIN_BYTES=4096
TOON_FAIL_OPEN=true
TOON_LOG_STATS=true

# ==============================================================================
# Smart Tool Selection (Advanced Token Optimization)
# ==============================================================================
Expand All @@ -310,6 +446,49 @@ SMART_TOOL_SELECTION_MODE=heuristic
# Maximum token budget for tools per request
SMART_TOOL_SELECTION_TOKEN_BUDGET=2500

# ==============================================================================
# Tool Needs Classification (LLM-Based)
# ==============================================================================

# Enable LLM-based tool needs classification
# When enabled, classifies requests as tool-needed or conversational before routing
# This reduces token overhead and latency for simple questions and greetings
# Default: false (disabled)
# TOOL_NEEDS_CLASSIFICATION_ENABLED=true

# Model to use for classification (lightweight recommended)
# Options: qwen2.5:1b, llama3.2:1b, or "skip" to disable LLM (whitelist-only)
# Lightweight models are fast (~100-500ms) and accurate for yes/no classification
# Default: qwen2.5:1b
# TOOL_NEEDS_CLASSIFICATION_MODEL=qwen2.5:1b

# Path to default whitelist file with known patterns
# Whitelist provides fast-path matching for common requests (no LLM call)
# Default: ./config/tool-whitelist.json
# TOOL_NEEDS_CLASSIFICATION_WHITELIST=./config/tool-whitelist.json

# Path to user whitelist file (optional, extends default)
# Create your own patterns in a separate file to avoid conflicts on updates
# Example: ./config/tool-whitelist-user.json
# TOOL_NEEDS_CLASSIFICATION_USER_WHITELIST=./config/tool-whitelist-user.json

# Custom shell commands that need tools (comma-separated)
# Automatically adds both "command" and "command *" patterns to whitelist
# Example: bd,mycommand,anothercmd
# Use this for project-specific CLIs like bd (beads), make, cargo, etc.
# Default: empty
# TOOL_NEEDS_CLASSIFICATION_CUSTOM_COMMANDS=bd

# Enable result caching to avoid repeated classification
# Caches results by normalized message content
# Default: true
# TOOL_NEEDS_CLASSIFICATION_CACHE_ENABLED=true

# Enable LLM fallback when whitelist doesn't match
# If false, defaults to "needs tools" when whitelist misses
# Default: true
# TOOL_NEEDS_CLASSIFICATION_LLM_ENABLED=true

# ==============================================================================
# Performance & Security
# ==============================================================================
Expand All @@ -334,6 +513,37 @@ HOT_RELOAD_ENABLED=true
# Debounce delay in ms (prevents rapid reloads)
HOT_RELOAD_DEBOUNCE_MS=1000

# ==============================================================================
# LLM Audit Logging
# ==============================================================================

# Enable LLM audit logging (default: false)
# Logs all LLM requests/responses for debugging and analysis
LLM_AUDIT_ENABLED=false

# Audit log file path
# LLM_AUDIT_LOG_FILE=./logs/llm-audit.log

# Maximum content length per field (characters) - controls log file size
# LLM_AUDIT_MAX_CONTENT_LENGTH=5000
# LLM_AUDIT_MAX_SYSTEM_LENGTH=2000
# LLM_AUDIT_MAX_USER_LENGTH=3000
# LLM_AUDIT_MAX_RESPONSE_LENGTH=3000

# Log rotation settings
# LLM_AUDIT_MAX_FILES=30
# LLM_AUDIT_MAX_SIZE=100M

# Include annotation metadata in audit logs (default: true)
# LLM_AUDIT_ANNOTATIONS=true

# Deduplication - reduces log size by referencing repeated content
# LLM_AUDIT_DEDUP_ENABLED=true
# LLM_AUDIT_DEDUP_MIN_SIZE=500
# LLM_AUDIT_DEDUP_CACHE_SIZE=100
# LLM_AUDIT_DEDUP_SANITIZE=true
# LLM_AUDIT_DEDUP_SESSION_CACHE=true

# ==============================================================================
# Quick Start Examples
# ==============================================================================
Expand Down
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,7 @@ Lynkr supports [ClawdBot](https://github.com/openclaw/openclaw) via its OpenAI-c
- ✅ **Streaming Support** - Real-time token streaming for all providers
- ✅ **Memory System** - Titans-inspired long-term memory with surprise-based filtering
- ✅ **Tool Calling** - Full tool support with server and passthrough execution modes
- ✅ **Progress Reporting** - Real-time agent execution tracking with WebSocket broadcasting (port 8765)
- ✅ **Production Ready** - Battle-tested with 400+ tests, observability, and error resilience
- ✅ **Node 20-25 Support** - Works with latest Node.js versions including v25
- ✅ **Semantic Caching** - Cache responses for similar prompts (requires embeddings)
Expand Down Expand Up @@ -318,6 +319,76 @@ OLLAMA_EMBEDDINGS_ENDPOINT=http://localhost:11434/api/embeddings

---

## Progress Reporting

Lynkr emits **real-time progress events** throughout agent execution, enabling comprehensive monitoring of tool execution, model invocations, and reasoning steps. These events are:
- Emitted as Node.js events internally
- Automatically broadcasted via WebSocket (port 8765) for external clients
- Logged for observability

**WebSocket Server (for External Clients):**
- **Port**: `8765`
- **Endpoint**: `ws://localhost:8765`
- **Required Dependency**: `ws` (auto-installed with `npm install`)

**Events Emitted:**
- `agent_loop_started` / `agentLoopCompleted` - Agent execution lifecycle
- `agent_loop_step_started` - Individual step in agent reasoning
- `model_invocation_started` / `modelInvocationCompleted` - LLM calls with provider info
- `tool_execution_started` / `toolExecutionCompleted` - Tool execution with request/response previews

**Built-in Progress Listener (Python):**

Lynkr includes a ready-to-use Python client that connects to the WebSocket server and displays formatted progress updates:

```bash
# Install Python dependencies (one-time)
pip install websockets

# Run the listener in one terminal
python tools/progress-listener.py

# In another terminal, run Lynkr and Claude Code
npm start
claude "Your prompt"
```

**Features:**
- 🎨 Color-coded output with timestamps
- 🔄 Real-time agent hierarchy tracking (shows parent/child agent relationships)
- ⏱️ Duration and token tracking for model invocations
- 🛠️ Tool execution details with request/response previews
- 🌐 Remote monitoring: `python tools/progress-listener.py --host 192.168.1.100`
- 🔧 Environment variables: `LYNKR_PROGRESS_HOST` and `LYNKR_PROGRESS_PORT`

**Custom Python Client Example:**
```python
import json
import asyncio
import websockets

async def monitor_progress():
uri = "ws://localhost:8765"
async with websockets.connect(uri) as websocket:
while True:
event = await websocket.recv()
data = json.loads(event)
print(f"Event: {data['type']}")
print(f"Data: {json.dumps(data, indent=2)}")

asyncio.run(monitor_progress())
```

**Use Cases:**
- Monitor tool execution in real-time during Claude Code CLI runs
- Track agent reasoning steps and model invocations
- Build custom dashboards showing agent progress
- Debug multi-step agentic workflows
- Troubleshoot subagent spawning and routing
- Monitor remote Lynkr instances

---

## Architecture

```
Expand Down
Loading