GenAI / MCP / RAG improvement roadmap

## Overview

This issue tracks planned improvements and next steps for ProxySQL's GenAI, MCP (Model Context Protocol), and RAG (Retrieval-Augmented Generation) subsystems. The current architecture provides a solid foundation — this roadmap outlines actionable work to bring each component to full production readiness.

## Current Architecture Summary

The GenAI/MCP/RAG system is built around four main components:

| Component | Entry Point | Purpose | Maturity |
|---|---|---|---|
| **GenAI_Threads_Handler** | `GloGATH` | Async embedding & reranking worker pool | Production-ready |
| **LLM_Bridge** | via `AI_Features_Manager` | Multi-provider LLM access with caching | Partial (see below) |
| **MCP_Threads_Handler** | `GloMCPH` | HTTP/HTTPS server exposing 7 tool endpoints | Solid |
| **Anomaly_Detector** | via `AI_Features_Manager` | Multi-stage query threat detection | Partial (see below) |
| **RAG_Tool_Handler** | via MCP `/mcp/rag` | Hybrid FTS + vector search with RRF ranking | Solid |

### MCP Endpoint Map

```
HTTP/HTTPS Server (libhttpserver, port 6071)
├── /mcp/config     → Config_Tool_Handler
├── /mcp/query      → Query_Tool_Handler (two-phase discovery)
├── /mcp/admin      → Admin_Tool_Handler
├── /mcp/cache      → Cache_Tool_Handler
├── /mcp/stats      → Stats_Tool_Handler
├── /mcp/ai         → AI_Tool_Handler (NL2SQL, LLM)
└── /mcp/rag        → RAG_Tool_Handler (FTS + vector search)
```

### Configuration Variables

The system exposes ~48 `genai-*` variables and ~15 `mcp-*` variables via the admin interface, covering:
- Embedding/reranking service endpoints and timeouts
- LLM provider configuration (OpenAI, Anthropic, Ollama-compatible)
- Anomaly detection thresholds and rate limits
- RAG search parameters (k_max, candidates, timeout)
- MCP server settings (port, SSL, per-endpoint auth)
- Hybrid routing preferences (local vs cloud, budget limits)

---

## Roadmap Items

### 1. Complete LLM Semantic Cache

**Priority: High | Effort: Medium**

The LLM_Bridge has a semantic cache design where similar queries can return cached LLM responses instead of making redundant API calls. The infrastructure exists (vector DB via sqlite-vec, similarity threshold configuration, cache statistics counters) but the core operations need implementation:

- `check_cache()` — vector similarity lookup against previous queries
- `store_in_cache()` — store query embedding + response after successful LLM call
- `clear_cache()` — cache eviction (by age, size, or manual)
- `get_cache_stats()` — expose hit rate, size, avg lookup time

The `genai-llm_cache_enabled` and `genai-llm_cache_similarity_threshold` variables are already defined and wired through the admin interface.

**Acceptance criteria:**
- Cache reduces duplicate LLM calls by >80% for repeated/similar queries
- Cache hit/miss metrics visible in `stats_genai_*` tables
- Cache can be cleared at runtime via admin command

---

### 2. Configurable Embedding Dimensions

**Priority: High | Effort: Low**

The vector dimension is currently set to 1536 (matching OpenAI `text-embedding-3-small`). Since ProxySQL supports multiple providers including Ollama and local models, the dimension should adapt to the model in use:

- Ollama/llama models: typically 4096
- OpenAI text-embedding-3-small: 1536
- OpenAI text-embedding-3-large: 3072
- Cohere: 1024
- Various others

**Proposed approach:**
- Auto-detect dimension from the first embedding response
- Or allow explicit override via `genai-vector_dimension` (variable already exists, just needs validation)
- Ensure sqlite-vec schema adapts when dimension changes

---

### 3. Anomaly Detection — Embedding-Based Threat Matching

**Priority: Medium | Effort: Medium**

The Anomaly_Detector currently has a solid multi-stage pipeline:
1. SQL injection pattern detection (regex)
2. Suspicious keyword detection
3. Query normalization and fingerprinting
4. Per-user rate limiting
5. Statistical outlier detection

The planned embedding-based threat similarity stage (comparing query embeddings against a known-threats vector database) would add ML-based detection for novel attack patterns that bypass regex rules.

**Implementation steps:**
- Build a threat embeddings database from known SQL injection payloads
- On each query, compute embedding and check cosine similarity against threat DB
- Integrate with the existing `genai-anomaly_similarity_threshold` variable
- Track false positive rate to tune thresholds

---

### 4. Circuit Breaker for External Services

**Priority: Medium | Effort: Low**

When external services (embedding endpoint, LLM provider, rerank service) are unavailable, every request currently waits for the full timeout before failing. Under load, this causes cascading latency.

**Proposed approach:**
- Track consecutive failures per service endpoint
- After N failures (configurable), enter "open" state — fast-fail for a cooldown period
- After cooldown, allow one probe request ("half-open" state)
- If probe succeeds, return to normal; if fails, extend cooldown
- Expose circuit breaker state in stats tables

---

### 5. Observability Enhancements

**Priority: Medium | Effort: Low**

Add a `stats_genai_summary` view (or populate existing counters) showing:

| Metric | Source |
|---|---|
| LLM cache hit rate | `llm_cache_hits / llm_total_requests` |
| Avg LLM response time | `llm_total_response_time_ms / llm_total_requests` |
| Local vs cloud model distribution | `llm_local_model_calls` vs `llm_cloud_model_calls` |
| Daily cloud spend | `daily_cloud_spend_usd` |
| Anomaly block rate | `anomaly_blocked_queries / anomaly_total_checks` |
| RAG search latency (p50/p99) | From `ToolUsageStats` |
| Embedding service health | Circuit breaker state |

The `AI_Features_Manager` already has atomic counters for most of these — they just need to be exposed via admin tables and/or Prometheus metrics.

---

### 6. Schema-Aware NL2SQL

**Priority: Medium | Effort: High**

The `ai.nl2sql_convert` MCP tool converts natural language to SQL via the LLM_Bridge. Its accuracy depends heavily on schema context. ProxySQL is uniquely positioned to provide this because it already knows:

- All backend servers (`mysql_servers`, `pgsql_servers`)
- Schema information via monitor (if enabled)
- Query patterns via `stats_mysql_query_digest`
- Table relationships via observed JOINs

**Proposed enhancement:**
- Pull live schema (tables, columns, types) from monitored backends
- Include top query patterns as examples in the LLM prompt
- Cache schema snapshots to avoid repeated introspection
- Support multi-database context (user specifies which hostgroup/schema)

---

### 7. RAG Timeout Tuning

**Priority: Low | Effort: Low**

The default `genai-rag_timeout_ms=2000` (2 seconds) may be tight for hybrid searches that involve FTS + vector similarity + Reciprocal Rank Fusion merging, especially with large document collections.

**Actions:**
- Benchmark RAG search latency at various collection sizes (1K, 10K, 100K documents)
- Adjust default timeout based on findings
- Consider adding per-operation timeouts (FTS timeout vs vector timeout) rather than a single global timeout
- Document recommended timeout values for different deployment sizes

---

### 8. Multi-Model Routing Enhancements

**Priority: Low | Effort: Medium**

The LLM_Bridge currently selects between local and cloud models based on a simple latency heuristic (prefer local if <500ms). The `genai-daily_budget_usd` and `genai-max_cloud_requests_per_hour` variables exist for cost control.

**Potential improvements:**
- Route by query complexity (simple queries → local, complex → cloud)
- Route by model capability (embedding vs generation vs classification)
- Enforce budget limits more strictly (current implementation tracks but may not hard-block)
- Add model-specific latency tracking for smarter routing decisions
- Support fallback chains (try local → if timeout → try cloud)

---

### 9. Test Coverage Expansion

**Priority: Medium | Effort: Medium**

Current test status:
- 4 GenAI TAP tests (embedding, reranking, module loading, async operations)
- 15+ MCP TAP tests (concurrency, stats, query rules, runtime variables)
- Several integration tests (bash scripts for headless flows, RAG metrics)

**Gaps to address:**
- LLM cache tests (blocked on implementation — item #1)
- Anomaly detection accuracy tests (false positive/negative rates)
- RAG search quality tests (precision/recall at various k values)
- Multi-provider failover tests (what happens when OpenAI is down?)
- Load tests for MCP server under concurrent connections
- Tests that verify GenAI features don't break standard proxy behavior (regression)

Some tests currently require external API keys or a GenAI-enabled build. The CI infrastructure should support both modes: with mock endpoints for unit testing, and with real endpoints for integration testing.

---

### 10. Local Embedding Support

**Priority: Low | Effort: High**

Currently, all embedding operations require an external HTTP service (`genai-embedding_uri`). The `sqlite-rembed` extension is already linked into the build — it could potentially provide local embedding computation without an external service dependency.

**Investigation needed:**
- Can sqlite-rembed generate embeddings with acceptable quality for the RAG use case?
- What models does it support and what are the dimension outputs?
- What's the performance profile (latency, memory) compared to an external service?
- Could this be an opt-in fallback when no external embedding service is configured?

---

## Related Issues

- #5530 — PgSQL user provisioning for ai-g1 test group
- #5531 — GenAI tests require API keys and GenAI-enabled build
- #5532 — Missing mcp_test_helpers.sh for MCP tests
- #5533 — MCP RAG endpoint and vector_db_performance test failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenAI / MCP / RAG improvement roadmap #5534

Overview

Current Architecture Summary

MCP Endpoint Map

Configuration Variables

Roadmap Items

1. Complete LLM Semantic Cache

2. Configurable Embedding Dimensions

3. Anomaly Detection — Embedding-Based Threat Matching

4. Circuit Breaker for External Services

5. Observability Enhancements

6. Schema-Aware NL2SQL

7. RAG Timeout Tuning

8. Multi-Model Routing Enhancements

9. Test Coverage Expansion

10. Local Embedding Support

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Entry Point	Purpose	Maturity
GenAI_Threads_Handler	`GloGATH`	Async embedding & reranking worker pool	Production-ready
LLM_Bridge	via `AI_Features_Manager`	Multi-provider LLM access with caching	Partial (see below)
MCP_Threads_Handler	`GloMCPH`	HTTP/HTTPS server exposing 7 tool endpoints	Solid
Anomaly_Detector	via `AI_Features_Manager`	Multi-stage query threat detection	Partial (see below)
RAG_Tool_Handler	via MCP `/mcp/rag`	Hybrid FTS + vector search with RRF ranking	Solid

Metric	Source
LLM cache hit rate	`llm_cache_hits / llm_total_requests`
Avg LLM response time	`llm_total_response_time_ms / llm_total_requests`
Local vs cloud model distribution	`llm_local_model_calls` vs `llm_cloud_model_calls`
Daily cloud spend	`daily_cloud_spend_usd`
Anomaly block rate	`anomaly_blocked_queries / anomaly_total_checks`
RAG search latency (p50/p99)	From `ToolUsageStats`
Embedding service health	Circuit breaker state

GenAI / MCP / RAG improvement roadmap #5534

Description

Overview

Current Architecture Summary

MCP Endpoint Map

Configuration Variables

Roadmap Items

1. Complete LLM Semantic Cache

2. Configurable Embedding Dimensions

3. Anomaly Detection — Embedding-Based Threat Matching

4. Circuit Breaker for External Services

5. Observability Enhancements

6. Schema-Aware NL2SQL

7. RAG Timeout Tuning

8. Multi-Model Routing Enhancements

9. Test Coverage Expansion

10. Local Embedding Support

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions