Skip to content

GenAI / MCP / RAG improvement roadmap #5534

@renecannao

Description

@renecannao

Overview

This issue tracks planned improvements and next steps for ProxySQL's GenAI, MCP (Model Context Protocol), and RAG (Retrieval-Augmented Generation) subsystems. The current architecture provides a solid foundation — this roadmap outlines actionable work to bring each component to full production readiness.

Current Architecture Summary

The GenAI/MCP/RAG system is built around four main components:

Component Entry Point Purpose Maturity
GenAI_Threads_Handler GloGATH Async embedding & reranking worker pool Production-ready
LLM_Bridge via AI_Features_Manager Multi-provider LLM access with caching Partial (see below)
MCP_Threads_Handler GloMCPH HTTP/HTTPS server exposing 7 tool endpoints Solid
Anomaly_Detector via AI_Features_Manager Multi-stage query threat detection Partial (see below)
RAG_Tool_Handler via MCP /mcp/rag Hybrid FTS + vector search with RRF ranking Solid

MCP Endpoint Map

HTTP/HTTPS Server (libhttpserver, port 6071)
├── /mcp/config     → Config_Tool_Handler
├── /mcp/query      → Query_Tool_Handler (two-phase discovery)
├── /mcp/admin      → Admin_Tool_Handler
├── /mcp/cache      → Cache_Tool_Handler
├── /mcp/stats      → Stats_Tool_Handler
├── /mcp/ai         → AI_Tool_Handler (NL2SQL, LLM)
└── /mcp/rag        → RAG_Tool_Handler (FTS + vector search)

Configuration Variables

The system exposes ~48 genai-* variables and ~15 mcp-* variables via the admin interface, covering:

  • Embedding/reranking service endpoints and timeouts
  • LLM provider configuration (OpenAI, Anthropic, Ollama-compatible)
  • Anomaly detection thresholds and rate limits
  • RAG search parameters (k_max, candidates, timeout)
  • MCP server settings (port, SSL, per-endpoint auth)
  • Hybrid routing preferences (local vs cloud, budget limits)

Roadmap Items

1. Complete LLM Semantic Cache

Priority: High | Effort: Medium

The LLM_Bridge has a semantic cache design where similar queries can return cached LLM responses instead of making redundant API calls. The infrastructure exists (vector DB via sqlite-vec, similarity threshold configuration, cache statistics counters) but the core operations need implementation:

  • check_cache() — vector similarity lookup against previous queries
  • store_in_cache() — store query embedding + response after successful LLM call
  • clear_cache() — cache eviction (by age, size, or manual)
  • get_cache_stats() — expose hit rate, size, avg lookup time

The genai-llm_cache_enabled and genai-llm_cache_similarity_threshold variables are already defined and wired through the admin interface.

Acceptance criteria:

  • Cache reduces duplicate LLM calls by >80% for repeated/similar queries
  • Cache hit/miss metrics visible in stats_genai_* tables
  • Cache can be cleared at runtime via admin command

2. Configurable Embedding Dimensions

Priority: High | Effort: Low

The vector dimension is currently set to 1536 (matching OpenAI text-embedding-3-small). Since ProxySQL supports multiple providers including Ollama and local models, the dimension should adapt to the model in use:

  • Ollama/llama models: typically 4096
  • OpenAI text-embedding-3-small: 1536
  • OpenAI text-embedding-3-large: 3072
  • Cohere: 1024
  • Various others

Proposed approach:

  • Auto-detect dimension from the first embedding response
  • Or allow explicit override via genai-vector_dimension (variable already exists, just needs validation)
  • Ensure sqlite-vec schema adapts when dimension changes

3. Anomaly Detection — Embedding-Based Threat Matching

Priority: Medium | Effort: Medium

The Anomaly_Detector currently has a solid multi-stage pipeline:

  1. SQL injection pattern detection (regex)
  2. Suspicious keyword detection
  3. Query normalization and fingerprinting
  4. Per-user rate limiting
  5. Statistical outlier detection

The planned embedding-based threat similarity stage (comparing query embeddings against a known-threats vector database) would add ML-based detection for novel attack patterns that bypass regex rules.

Implementation steps:

  • Build a threat embeddings database from known SQL injection payloads
  • On each query, compute embedding and check cosine similarity against threat DB
  • Integrate with the existing genai-anomaly_similarity_threshold variable
  • Track false positive rate to tune thresholds

4. Circuit Breaker for External Services

Priority: Medium | Effort: Low

When external services (embedding endpoint, LLM provider, rerank service) are unavailable, every request currently waits for the full timeout before failing. Under load, this causes cascading latency.

Proposed approach:

  • Track consecutive failures per service endpoint
  • After N failures (configurable), enter "open" state — fast-fail for a cooldown period
  • After cooldown, allow one probe request ("half-open" state)
  • If probe succeeds, return to normal; if fails, extend cooldown
  • Expose circuit breaker state in stats tables

5. Observability Enhancements

Priority: Medium | Effort: Low

Add a stats_genai_summary view (or populate existing counters) showing:

Metric Source
LLM cache hit rate llm_cache_hits / llm_total_requests
Avg LLM response time llm_total_response_time_ms / llm_total_requests
Local vs cloud model distribution llm_local_model_calls vs llm_cloud_model_calls
Daily cloud spend daily_cloud_spend_usd
Anomaly block rate anomaly_blocked_queries / anomaly_total_checks
RAG search latency (p50/p99) From ToolUsageStats
Embedding service health Circuit breaker state

The AI_Features_Manager already has atomic counters for most of these — they just need to be exposed via admin tables and/or Prometheus metrics.


6. Schema-Aware NL2SQL

Priority: Medium | Effort: High

The ai.nl2sql_convert MCP tool converts natural language to SQL via the LLM_Bridge. Its accuracy depends heavily on schema context. ProxySQL is uniquely positioned to provide this because it already knows:

  • All backend servers (mysql_servers, pgsql_servers)
  • Schema information via monitor (if enabled)
  • Query patterns via stats_mysql_query_digest
  • Table relationships via observed JOINs

Proposed enhancement:

  • Pull live schema (tables, columns, types) from monitored backends
  • Include top query patterns as examples in the LLM prompt
  • Cache schema snapshots to avoid repeated introspection
  • Support multi-database context (user specifies which hostgroup/schema)

7. RAG Timeout Tuning

Priority: Low | Effort: Low

The default genai-rag_timeout_ms=2000 (2 seconds) may be tight for hybrid searches that involve FTS + vector similarity + Reciprocal Rank Fusion merging, especially with large document collections.

Actions:

  • Benchmark RAG search latency at various collection sizes (1K, 10K, 100K documents)
  • Adjust default timeout based on findings
  • Consider adding per-operation timeouts (FTS timeout vs vector timeout) rather than a single global timeout
  • Document recommended timeout values for different deployment sizes

8. Multi-Model Routing Enhancements

Priority: Low | Effort: Medium

The LLM_Bridge currently selects between local and cloud models based on a simple latency heuristic (prefer local if <500ms). The genai-daily_budget_usd and genai-max_cloud_requests_per_hour variables exist for cost control.

Potential improvements:

  • Route by query complexity (simple queries → local, complex → cloud)
  • Route by model capability (embedding vs generation vs classification)
  • Enforce budget limits more strictly (current implementation tracks but may not hard-block)
  • Add model-specific latency tracking for smarter routing decisions
  • Support fallback chains (try local → if timeout → try cloud)

9. Test Coverage Expansion

Priority: Medium | Effort: Medium

Current test status:

  • 4 GenAI TAP tests (embedding, reranking, module loading, async operations)
  • 15+ MCP TAP tests (concurrency, stats, query rules, runtime variables)
  • Several integration tests (bash scripts for headless flows, RAG metrics)

Gaps to address:

  • LLM cache tests (blocked on implementation — item Remove argument MySQL_Data_Stream from MySQL_Protocol's function #1)
  • Anomaly detection accuracy tests (false positive/negative rates)
  • RAG search quality tests (precision/recall at various k values)
  • Multi-provider failover tests (what happens when OpenAI is down?)
  • Load tests for MCP server under concurrent connections
  • Tests that verify GenAI features don't break standard proxy behavior (regression)

Some tests currently require external API keys or a GenAI-enabled build. The CI infrastructure should support both modes: with mock endpoints for unit testing, and with real endpoints for integration testing.


10. Local Embedding Support

Priority: Low | Effort: High

Currently, all embedding operations require an external HTTP service (genai-embedding_uri). The sqlite-rembed extension is already linked into the build — it could potentially provide local embedding computation without an external service dependency.

Investigation needed:

  • Can sqlite-rembed generate embeddings with acceptable quality for the RAG use case?
  • What models does it support and what are the dimension outputs?
  • What's the performance profile (latency, memory) compared to an external service?
  • Could this be an opt-in fallback when no external embedding service is configured?

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions