⚖️ LegalLens — Agentic RAG for Legal Contract Analysis

An end-to-end, production-grade agentic RAG system that ingests 406 legal contracts, retrieves relevant clauses using hybrid search, reasons about contract risks using LLM agents, and evaluates itself against expert annotations — all deployable with a single docker compose up.

What It Does

Type a legal question, and LegalLens will:

Search 69,000+ contract chunks using hybrid retrieval (BM25 + dense embeddings + RRF fusion)
Rerank results using a cross-encoder for precision
Detect which specific contract you're asking about (contract-aware retrieval)
Route your query to the right agent tool (clause extraction, risk analysis, or summarization)
Generate structured JSON answers grounded in the retrieved context

Example:

Query: "What is the governing law in the Todos Medical agreement with Care G.B. Plus?"

Response:
{
  "clause_type": "Governing Law",
  "found": true,
  "clauses": [{
    "text": "This Agreement shall be governed by and construed in accordance
             with the laws of the State of Israel, and the courts of Tel-Aviv, Israel",
    "summary": "Governed by laws of Israel with Tel-Aviv courts jurisdiction"
  }]
}

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        CUAD Dataset                             │
│                   406 contracts · 69K chunks                    │
└──────────────┬──────────────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────────────┐
│  Data Pipeline (Phase 1)                                         │
│  Download → Parse → Recursive Chunking → DVC Versioning          │
└──────────┬───────────────────────────┬───────────────────────────┘
           │                           │
           ▼                           ▼
┌─────────────────────┐   ┌─────────────────────────┐
│   BM25 Sparse Index │   │  ChromaDB Dense Index    │
│   rank-bm25         │   │  all-MiniLM-L6-v2        │
│   1.9M tokens       │   │  69K vectors (384-dim)   │
└─────────┬───────────┘   └──────────┬────────────────┘
          │                          │
          └────────────┬─────────────┘
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│  Hybrid Retrieval (Phase 2)                                      │
│  RRF Fusion → Contract Detection → Cross-Encoder Rerank          │
│  ms-marco-MiniLM-L-6-v2                                          │
└──────────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│  Agentic RAG (Phase 3)                    Groq / Llama 3.3 70B  │
│  ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐        │
│  │   Clause     │ │    Risk      │ │    Contract        │        │
│  │  Extractor   │ │  Analyzer    │ │   Summarizer       │        │
│  └─────────────┘ └──────────────┘ └────────────────────┘        │
│  LangGraph orchestration · Intent classification · JSON output   │
└──────────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│  Evaluation (Phase 4)                                            │
│  LLM-as-a-Judge: Faithfulness 4.6/5 · Relevance 4.0/5           │
│  Context Relevance 4.2/5 · Overall 4.27/5                        │
│  MLflow experiment tracking · Retrieval metrics (MRR 0.29)       │
└──────────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│  Serving (Phase 5)                         Docker Compose        │
│  FastAPI · Streamlit · ChromaDB · MLflow · Prometheus · Grafana  │
│  6 services · One-command deployment · Health checks             │
└──────────────────────────────────────────────────────────────────┘

Quick Start

Option 1: Local Development

# Clone and setup
git clone https://github.com/Ahb1104/legallens.git
cd legallens
python -m venv .venv && source .venv/bin/activate
make setup

# Add your Groq API key (free at https://console.groq.com)
# Edit .env and set LEGALLENS_GROQ_API_KEY=gsk_your_key_here

# Run the full ML pipeline
python -m scripts.run_phase1 --no-benchmark    # ~4 min
make phase2-quick                               # ~30s
make phase3                                     # ~20s
make phase4-quick                               # ~50s

# Start the app
make phase5
# Streamlit UI:  http://localhost:8501
# API docs:      http://localhost:8000/docs

Option 2: Docker (Full Stack)

git clone https://github.com/Ahb1104/legallens.git
cd legallens
python -m venv .venv && source .venv/bin/activate
make setup

# Edit .env with your Groq API key
python -m scripts.run_phase1 --no-benchmark

# Launch all 6 services
docker compose up -d --build

# Streamlit UI:  http://localhost:8501
# API docs:      http://localhost:8000/docs
# Grafana:       http://localhost:3000  (admin / legallens)
# MLflow:        http://localhost:5001
# Prometheus:    http://localhost:9090

How to Use

Streamlit UI (http://localhost:8501)

Agent Mode — Type a natural language question:

Query	Tool Used	What You Get
"Extract the non-compete clause"	Clause Extractor	Exact clause text + plain English summary
"What are the risks in termination provisions?"	Risk Analyzer	Risk list with severity + recommendations
"Summarize the key terms and obligations"	Summarizer	Executive summary, key terms, obligations
"What is the governing law in the Todos Medical agreement?"	Clause Extractor	Contract-specific clause with jurisdiction
"Is the distributor restricted from selling competing products?"	Clause Extractor	Cross-contract non-compete comparison

Search Mode — Returns raw retrieved passages with relevance scores. Shows what the retriever finds before the LLM processes it.

REST API (http://localhost:8000/docs)

# Full agent pipeline
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Extract the non-compete clause", "top_k": 5}'

# Retrieval only (no LLM)
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "indemnification liability", "top_k": 10}'

# Health check
curl http://localhost:8000/health

Evaluation Results

Retrieval (Phase 2) — 41 Clause Types

Method	MRR	Recall@1	Recall@5	Recall@10	Recall@20
BM25 only	0.171	0.049	0.317	0.512	0.683
Dense only	0.175	0.073	0.317	0.439	0.610
Hybrid (RRF)	0.217	0.098	0.341	0.463	0.659
Hybrid + Rerank	0.290	0.171	0.415	0.561	0.659

Each layer adds measurable value. Hybrid + Rerank outperforms all individual methods.

Agent Quality (Phase 4) — LLM-as-a-Judge (1-5 scale)

Metric	Score
Faithfulness (grounded in context)	4.60 / 5
Answer Relevance (addresses the question)	4.00 / 5
Context Relevance (right passages retrieved)	4.20 / 5
Overall	4.27 / 5

End-to-End Pipeline (Full RAG) — 20 Clause Types

Metric	Score
Answer Recall	0.540
Accuracy (Recall ≥ 0.5)	55.0%
Answer F1	0.141
Queries evaluated	20

Top performing clauses: governing law (93% recall), termination for convenience (88%), no-solicit of employees (86%), audit rights (79%).

All metrics measured against expert lawyer annotations — 22,450 ground truth clause extractions labeled by trained law students and reviewed by experienced attorneys.

Note: F1 is low by design — the system generates analytical paragraphs while ground truth is short clause excerpts. Answer Recall is the primary metric for this analytical use case.

Project Structure

legallens/
├── config/
│   └── settings.py              # Pydantic-settings config (all params via .env)
├── src/
│   ├── data/
│   │   ├── download.py          # CUAD dataset downloader (GitHub → SQuAD JSON)
│   │   └── chunker.py           # Recursive + semantic chunking with metadata
│   ├── indexing/
│   │   ├── bm25_index.py        # BM25Okapi sparse index with legal stopwords
│   │   └── chroma_index.py      # ChromaDB dense index (all-MiniLM-L6-v2)
│   ├── retrieval/
│   │   ├── hybrid.py            # RRF fusion + contract-aware filtering
│   │   ├── reranker.py          # Cross-encoder reranking (ms-marco-MiniLM)
│   │   └── evaluator.py         # Retrieval eval with query extraction + token F1
│   ├── agent/
│   │   ├── tools.py             # 3 tools with neighbor-chunk context
│   │   └── graph.py             # LangGraph agent with intent classification
│   ├── evaluation/
│   │   ├── llm_judge.py         # LLM-as-a-judge (faithfulness, relevance, context)
│   │   └── tracker.py           # MLflow experiment logging
│   └── api/
│       ├── main.py              # FastAPI with Prometheus metrics
│       └── streamlit_app.py     # Streamlit UI
├── scripts/
│   ├── run_phase1.py            # Data pipeline orchestrator
│   ├── run_phase2.py            # Retrieval evaluation
│   ├── run_phase3.py            # Agent demo
│   ├── run_phase4.py            # LLM-as-judge evaluation
│   └── run_phase5.py            # Local server launcher
├── tests/
│   └── test_chunker.py          # Unit tests
├── monitoring/
│   └── prometheus.yml           # Prometheus scrape config
├── data/                        # DVC-tracked (not in git)
├── Dockerfile
├── docker-compose.yml           # 6-service stack
├── Makefile
├── pyproject.toml
├── requirements.txt
└── README.md

Tech Stack

Layer	Tool	Purpose
LLM	Groq API (Llama 3.3 70B)	Agent reasoning + evaluation
Embeddings	all-MiniLM-L6-v2	Dense vector embeddings (384-dim)
Reranker	ms-marco-MiniLM-L-6-v2	Cross-encoder reranking
Vector DB	ChromaDB	Persistent dense vector storage
Sparse Search	rank-bm25	BM25Okapi keyword retrieval
Agent	LangGraph	Tool orchestration + state management
Evaluation	LLM-as-a-Judge	Faithfulness, relevance, context scoring
Tracking	MLflow	Experiment parameter/metric logging
Data Versioning	DVC	Git-based data versioning
API	FastAPI	REST API with Prometheus instrumentation
UI	Streamlit	Interactive query interface
Monitoring	Prometheus + Grafana	Latency and throughput dashboards
Containers	Docker Compose	One-command 6-service deployment

Total cost: $0 — All tools are free tier or run locally.

Key Design Decisions

Hybrid retrieval — BM25 catches exact keyword matches (legal terms, party names). Dense embeddings catch semantic similarity ("covenant not to compete" ≈ "non-compete clause"). RRF fusion combines both without score normalization.

Contract-aware retrieval — When a query mentions a specific company name, the system detects this and filters retrieval to that contract's chunks only. Without this, generic clause queries return results from whichever contract has the strongest keyword match.

Neighbor chunk context — Legal clauses often span chunk boundaries. When the retriever finds a relevant chunk, the system includes the previous and next chunks from the same contract, eliminating sentence cutoff problems.

LLM-as-a-Judge — Scores faithfulness, relevance, and context quality on a 1-5 scale with reasoning. Simpler and more stable than RAGAS, using the same Groq API.

Token overlap evaluation — CUAD ground truth answers average 400-600 chars but chunks are ~320 chars. Token-level F1 at 0.3 threshold handles answer-chunk boundary mismatches (same approach as SQuAD benchmarks).

Query extraction from CUAD templates — CUAD questions like "Highlight the parts related to 'Non-Compete'" get transformed to "non-compete clause" so BM25 gets useful signal instead of matching boilerplate.

Dataset

CUAD (Contract Understanding Atticus Dataset) — NeurIPS 2021

406 unique contracts from SEC EDGAR filings
22,450 QA annotations by expert lawyers
41 clause types (termination, indemnification, non-compete, IP, governing law, etc.)
69,522 indexed chunks (512 chars, 128 overlap)
Licensed under CC BY 4.0

Docker Services

Service	Port	Purpose
`api`	8000	FastAPI REST API
`streamlit`	8501	Web UI
`chromadb`	8100	Vector database
`mlflow`	5001	Experiment tracking
`prometheus`	9090	Metrics collection
`grafana`	3000	Monitoring dashboards (admin / legallens)

docker compose up -d --build    # Start all
docker compose ps               # Check status
docker compose logs -f api      # Follow API logs
docker compose down             # Stop all

Makefile Commands

make setup           # Install deps + init DVC
make phase1          # Data pipeline (download + chunk + index)
make phase2          # Retrieval evaluation
make phase3          # Agent demo (3 queries)
make phase4          # LLM-as-judge evaluation
make phase5          # Start FastAPI + Streamlit locally
make docker-up       # Start Docker stack
make docker-down     # Stop Docker stack
make test            # Run unit tests
make lint            # Run ruff linter

Future Improvements

Contract selector UI — Dropdown to pick a specific contract before querying
PDF upload — Allow users to upload their own contracts for analysis
Multi-contract comparison — "Compare indemnification across all software agreements"
Fine-tuned embeddings — Domain-specific legal embeddings for better retrieval
Semantic chunking — Embedding similarity breakpoints instead of fixed-size splits
Output guardrails — JSON validation and hallucination detection

Citation

@article{hendrycks2021cuad,
  title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
  author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
  journal={NeurIPS},
  year={2021}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ LegalLens — Agentic RAG for Legal Contract Analysis

What It Does

Architecture

Quick Start

Option 1: Local Development

Option 2: Docker (Full Stack)

How to Use

Streamlit UI (http://localhost:8501)

REST API (http://localhost:8000/docs)

Evaluation Results

Retrieval (Phase 2) — 41 Clause Types

Agent Quality (Phase 4) — LLM-as-a-Judge (1-5 scale)

End-to-End Pipeline (Full RAG) — 20 Clause Types

Project Structure

Tech Stack

Key Design Decisions

Dataset

Docker Services

Makefile Commands

Future Improvements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.dvc		.dvc
config		config
data		data
monitoring		monitoring
scripts		scripts
src		src
tests		tests
.dvcignore		.dvcignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⚖️ LegalLens — Agentic RAG for Legal Contract Analysis

What It Does

Architecture

Quick Start

Option 1: Local Development

Option 2: Docker (Full Stack)

How to Use

Streamlit UI (http://localhost:8501)

REST API (http://localhost:8000/docs)

Evaluation Results

Retrieval (Phase 2) — 41 Clause Types

Agent Quality (Phase 4) — LLM-as-a-Judge (1-5 scale)

End-to-End Pipeline (Full RAG) — 20 Clause Types

Project Structure

Tech Stack

Key Design Decisions

Dataset

Docker Services

Makefile Commands

Future Improvements

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages