An end-to-end, production-grade agentic RAG system that ingests 406 legal contracts, retrieves relevant clauses using hybrid search, reasons about contract risks using LLM agents, and evaluates itself against expert annotations — all deployable with a single docker compose up.
Type a legal question, and LegalLens will:
- Search 69,000+ contract chunks using hybrid retrieval (BM25 + dense embeddings + RRF fusion)
- Rerank results using a cross-encoder for precision
- Detect which specific contract you're asking about (contract-aware retrieval)
- Route your query to the right agent tool (clause extraction, risk analysis, or summarization)
- Generate structured JSON answers grounded in the retrieved context
Example:
Query: "What is the governing law in the Todos Medical agreement with Care G.B. Plus?"
Response:
{
"clause_type": "Governing Law",
"found": true,
"clauses": [{
"text": "This Agreement shall be governed by and construed in accordance
with the laws of the State of Israel, and the courts of Tel-Aviv, Israel",
"summary": "Governed by laws of Israel with Tel-Aviv courts jurisdiction"
}]
}
┌─────────────────────────────────────────────────────────────────┐
│ CUAD Dataset │
│ 406 contracts · 69K chunks │
└──────────────┬──────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Data Pipeline (Phase 1) │
│ Download → Parse → Recursive Chunking → DVC Versioning │
└──────────┬───────────────────────────┬───────────────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────┐
│ BM25 Sparse Index │ │ ChromaDB Dense Index │
│ rank-bm25 │ │ all-MiniLM-L6-v2 │
│ 1.9M tokens │ │ 69K vectors (384-dim) │
└─────────┬───────────┘ └──────────┬────────────────┘
│ │
└────────────┬─────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Hybrid Retrieval (Phase 2) │
│ RRF Fusion → Contract Detection → Cross-Encoder Rerank │
│ ms-marco-MiniLM-L-6-v2 │
└──────────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Agentic RAG (Phase 3) Groq / Llama 3.3 70B │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ Clause │ │ Risk │ │ Contract │ │
│ │ Extractor │ │ Analyzer │ │ Summarizer │ │
│ └─────────────┘ └──────────────┘ └────────────────────┘ │
│ LangGraph orchestration · Intent classification · JSON output │
└──────────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Evaluation (Phase 4) │
│ LLM-as-a-Judge: Faithfulness 4.6/5 · Relevance 4.0/5 │
│ Context Relevance 4.2/5 · Overall 4.27/5 │
│ MLflow experiment tracking · Retrieval metrics (MRR 0.29) │
└──────────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Serving (Phase 5) Docker Compose │
│ FastAPI · Streamlit · ChromaDB · MLflow · Prometheus · Grafana │
│ 6 services · One-command deployment · Health checks │
└──────────────────────────────────────────────────────────────────┘
# Clone and setup
git clone https://github.com/Ahb1104/legallens.git
cd legallens
python -m venv .venv && source .venv/bin/activate
make setup
# Add your Groq API key (free at https://console.groq.com)
# Edit .env and set LEGALLENS_GROQ_API_KEY=gsk_your_key_here
# Run the full ML pipeline
python -m scripts.run_phase1 --no-benchmark # ~4 min
make phase2-quick # ~30s
make phase3 # ~20s
make phase4-quick # ~50s
# Start the app
make phase5
# Streamlit UI: http://localhost:8501
# API docs: http://localhost:8000/docsgit clone https://github.com/Ahb1104/legallens.git
cd legallens
python -m venv .venv && source .venv/bin/activate
make setup
# Edit .env with your Groq API key
python -m scripts.run_phase1 --no-benchmark
# Launch all 6 services
docker compose up -d --build
# Streamlit UI: http://localhost:8501
# API docs: http://localhost:8000/docs
# Grafana: http://localhost:3000 (admin / legallens)
# MLflow: http://localhost:5001
# Prometheus: http://localhost:9090Streamlit UI (http://localhost:8501)
Agent Mode — Type a natural language question:
| Query | Tool Used | What You Get |
|---|---|---|
| "Extract the non-compete clause" | Clause Extractor | Exact clause text + plain English summary |
| "What are the risks in termination provisions?" | Risk Analyzer | Risk list with severity + recommendations |
| "Summarize the key terms and obligations" | Summarizer | Executive summary, key terms, obligations |
| "What is the governing law in the Todos Medical agreement?" | Clause Extractor | Contract-specific clause with jurisdiction |
| "Is the distributor restricted from selling competing products?" | Clause Extractor | Cross-contract non-compete comparison |
Search Mode — Returns raw retrieved passages with relevance scores. Shows what the retriever finds before the LLM processes it.
REST API (http://localhost:8000/docs)
# Full agent pipeline
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "Extract the non-compete clause", "top_k": 5}'
# Retrieval only (no LLM)
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query": "indemnification liability", "top_k": 10}'
# Health check
curl http://localhost:8000/health| Method | MRR | Recall@1 | Recall@5 | Recall@10 | Recall@20 |
|---|---|---|---|---|---|
| BM25 only | 0.171 | 0.049 | 0.317 | 0.512 | 0.683 |
| Dense only | 0.175 | 0.073 | 0.317 | 0.439 | 0.610 |
| Hybrid (RRF) | 0.217 | 0.098 | 0.341 | 0.463 | 0.659 |
| Hybrid + Rerank | 0.290 | 0.171 | 0.415 | 0.561 | 0.659 |
Each layer adds measurable value. Hybrid + Rerank outperforms all individual methods.
| Metric | Score |
|---|---|
| Faithfulness (grounded in context) | 4.60 / 5 |
| Answer Relevance (addresses the question) | 4.00 / 5 |
| Context Relevance (right passages retrieved) | 4.20 / 5 |
| Overall | 4.27 / 5 |
| Metric | Score |
|---|---|
| Answer Recall | 0.540 |
| Accuracy (Recall ≥ 0.5) | 55.0% |
| Answer F1 | 0.141 |
| Queries evaluated | 20 |
Top performing clauses: governing law (93% recall), termination for convenience (88%), no-solicit of employees (86%), audit rights (79%).
All metrics measured against expert lawyer annotations — 22,450 ground truth clause extractions labeled by trained law students and reviewed by experienced attorneys.
Note: F1 is low by design — the system generates analytical paragraphs while ground truth is short clause excerpts. Answer Recall is the primary metric for this analytical use case.
legallens/
├── config/
│ └── settings.py # Pydantic-settings config (all params via .env)
├── src/
│ ├── data/
│ │ ├── download.py # CUAD dataset downloader (GitHub → SQuAD JSON)
│ │ └── chunker.py # Recursive + semantic chunking with metadata
│ ├── indexing/
│ │ ├── bm25_index.py # BM25Okapi sparse index with legal stopwords
│ │ └── chroma_index.py # ChromaDB dense index (all-MiniLM-L6-v2)
│ ├── retrieval/
│ │ ├── hybrid.py # RRF fusion + contract-aware filtering
│ │ ├── reranker.py # Cross-encoder reranking (ms-marco-MiniLM)
│ │ └── evaluator.py # Retrieval eval with query extraction + token F1
│ ├── agent/
│ │ ├── tools.py # 3 tools with neighbor-chunk context
│ │ └── graph.py # LangGraph agent with intent classification
│ ├── evaluation/
│ │ ├── llm_judge.py # LLM-as-a-judge (faithfulness, relevance, context)
│ │ └── tracker.py # MLflow experiment logging
│ └── api/
│ ├── main.py # FastAPI with Prometheus metrics
│ └── streamlit_app.py # Streamlit UI
├── scripts/
│ ├── run_phase1.py # Data pipeline orchestrator
│ ├── run_phase2.py # Retrieval evaluation
│ ├── run_phase3.py # Agent demo
│ ├── run_phase4.py # LLM-as-judge evaluation
│ └── run_phase5.py # Local server launcher
├── tests/
│ └── test_chunker.py # Unit tests
├── monitoring/
│ └── prometheus.yml # Prometheus scrape config
├── data/ # DVC-tracked (not in git)
├── Dockerfile
├── docker-compose.yml # 6-service stack
├── Makefile
├── pyproject.toml
├── requirements.txt
└── README.md
| Layer | Tool | Purpose |
|---|---|---|
| LLM | Groq API (Llama 3.3 70B) | Agent reasoning + evaluation |
| Embeddings | all-MiniLM-L6-v2 | Dense vector embeddings (384-dim) |
| Reranker | ms-marco-MiniLM-L-6-v2 | Cross-encoder reranking |
| Vector DB | ChromaDB | Persistent dense vector storage |
| Sparse Search | rank-bm25 | BM25Okapi keyword retrieval |
| Agent | LangGraph | Tool orchestration + state management |
| Evaluation | LLM-as-a-Judge | Faithfulness, relevance, context scoring |
| Tracking | MLflow | Experiment parameter/metric logging |
| Data Versioning | DVC | Git-based data versioning |
| API | FastAPI | REST API with Prometheus instrumentation |
| UI | Streamlit | Interactive query interface |
| Monitoring | Prometheus + Grafana | Latency and throughput dashboards |
| Containers | Docker Compose | One-command 6-service deployment |
Total cost: $0 — All tools are free tier or run locally.
Hybrid retrieval — BM25 catches exact keyword matches (legal terms, party names). Dense embeddings catch semantic similarity ("covenant not to compete" ≈ "non-compete clause"). RRF fusion combines both without score normalization.
Contract-aware retrieval — When a query mentions a specific company name, the system detects this and filters retrieval to that contract's chunks only. Without this, generic clause queries return results from whichever contract has the strongest keyword match.
Neighbor chunk context — Legal clauses often span chunk boundaries. When the retriever finds a relevant chunk, the system includes the previous and next chunks from the same contract, eliminating sentence cutoff problems.
LLM-as-a-Judge — Scores faithfulness, relevance, and context quality on a 1-5 scale with reasoning. Simpler and more stable than RAGAS, using the same Groq API.
Token overlap evaluation — CUAD ground truth answers average 400-600 chars but chunks are ~320 chars. Token-level F1 at 0.3 threshold handles answer-chunk boundary mismatches (same approach as SQuAD benchmarks).
Query extraction from CUAD templates — CUAD questions like "Highlight the parts related to 'Non-Compete'" get transformed to "non-compete clause" so BM25 gets useful signal instead of matching boilerplate.
CUAD (Contract Understanding Atticus Dataset) — NeurIPS 2021
- 406 unique contracts from SEC EDGAR filings
- 22,450 QA annotations by expert lawyers
- 41 clause types (termination, indemnification, non-compete, IP, governing law, etc.)
- 69,522 indexed chunks (512 chars, 128 overlap)
- Licensed under CC BY 4.0
| Service | Port | Purpose |
|---|---|---|
api |
8000 | FastAPI REST API |
streamlit |
8501 | Web UI |
chromadb |
8100 | Vector database |
mlflow |
5001 | Experiment tracking |
prometheus |
9090 | Metrics collection |
grafana |
3000 | Monitoring dashboards (admin / legallens) |
docker compose up -d --build # Start all
docker compose ps # Check status
docker compose logs -f api # Follow API logs
docker compose down # Stop allmake setup # Install deps + init DVC
make phase1 # Data pipeline (download + chunk + index)
make phase2 # Retrieval evaluation
make phase3 # Agent demo (3 queries)
make phase4 # LLM-as-judge evaluation
make phase5 # Start FastAPI + Streamlit locally
make docker-up # Start Docker stack
make docker-down # Stop Docker stack
make test # Run unit tests
make lint # Run ruff linter- Contract selector UI — Dropdown to pick a specific contract before querying
- PDF upload — Allow users to upload their own contracts for analysis
- Multi-contract comparison — "Compare indemnification across all software agreements"
- Fine-tuned embeddings — Domain-specific legal embeddings for better retrieval
- Semantic chunking — Embedding similarity breakpoints instead of fixed-size splits
- Output guardrails — JSON validation and hallucination detection
@article{hendrycks2021cuad,
title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
journal={NeurIPS},
year={2021}
}MIT