Skip to content

Ahb1104/legallens

Repository files navigation

⚖️ LegalLens — Agentic RAG for Legal Contract Analysis

An end-to-end, production-grade agentic RAG system that ingests 406 legal contracts, retrieves relevant clauses using hybrid search, reasons about contract risks using LLM agents, and evaluates itself against expert annotations — all deployable with a single docker compose up.

Python Groq Docker License


What It Does

Type a legal question, and LegalLens will:

  1. Search 69,000+ contract chunks using hybrid retrieval (BM25 + dense embeddings + RRF fusion)
  2. Rerank results using a cross-encoder for precision
  3. Detect which specific contract you're asking about (contract-aware retrieval)
  4. Route your query to the right agent tool (clause extraction, risk analysis, or summarization)
  5. Generate structured JSON answers grounded in the retrieved context

Example:

Query: "What is the governing law in the Todos Medical agreement with Care G.B. Plus?"

Response:
{
  "clause_type": "Governing Law",
  "found": true,
  "clauses": [{
    "text": "This Agreement shall be governed by and construed in accordance
             with the laws of the State of Israel, and the courts of Tel-Aviv, Israel",
    "summary": "Governed by laws of Israel with Tel-Aviv courts jurisdiction"
  }]
}

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        CUAD Dataset                             │
│                   406 contracts · 69K chunks                    │
└──────────────┬──────────────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────────────┐
│  Data Pipeline (Phase 1)                                         │
│  Download → Parse → Recursive Chunking → DVC Versioning          │
└──────────┬───────────────────────────┬───────────────────────────┘
           │                           │
           ▼                           ▼
┌─────────────────────┐   ┌─────────────────────────┐
│   BM25 Sparse Index │   │  ChromaDB Dense Index    │
│   rank-bm25         │   │  all-MiniLM-L6-v2        │
│   1.9M tokens       │   │  69K vectors (384-dim)   │
└─────────┬───────────┘   └──────────┬────────────────┘
          │                          │
          └────────────┬─────────────┘
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│  Hybrid Retrieval (Phase 2)                                      │
│  RRF Fusion → Contract Detection → Cross-Encoder Rerank          │
│  ms-marco-MiniLM-L-6-v2                                          │
└──────────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│  Agentic RAG (Phase 3)                    Groq / Llama 3.3 70B  │
│  ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐        │
│  │   Clause     │ │    Risk      │ │    Contract        │        │
│  │  Extractor   │ │  Analyzer    │ │   Summarizer       │        │
│  └─────────────┘ └──────────────┘ └────────────────────┘        │
│  LangGraph orchestration · Intent classification · JSON output   │
└──────────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│  Evaluation (Phase 4)                                            │
│  LLM-as-a-Judge: Faithfulness 4.6/5 · Relevance 4.0/5           │
│  Context Relevance 4.2/5 · Overall 4.27/5                        │
│  MLflow experiment tracking · Retrieval metrics (MRR 0.29)       │
└──────────────────────────┬───────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│  Serving (Phase 5)                         Docker Compose        │
│  FastAPI · Streamlit · ChromaDB · MLflow · Prometheus · Grafana  │
│  6 services · One-command deployment · Health checks             │
└──────────────────────────────────────────────────────────────────┘

Quick Start

Option 1: Local Development

# Clone and setup
git clone https://github.com/Ahb1104/legallens.git
cd legallens
python -m venv .venv && source .venv/bin/activate
make setup

# Add your Groq API key (free at https://console.groq.com)
# Edit .env and set LEGALLENS_GROQ_API_KEY=gsk_your_key_here

# Run the full ML pipeline
python -m scripts.run_phase1 --no-benchmark    # ~4 min
make phase2-quick                               # ~30s
make phase3                                     # ~20s
make phase4-quick                               # ~50s

# Start the app
make phase5
# Streamlit UI:  http://localhost:8501
# API docs:      http://localhost:8000/docs

Option 2: Docker (Full Stack)

git clone https://github.com/Ahb1104/legallens.git
cd legallens
python -m venv .venv && source .venv/bin/activate
make setup

# Edit .env with your Groq API key
python -m scripts.run_phase1 --no-benchmark

# Launch all 6 services
docker compose up -d --build

# Streamlit UI:  http://localhost:8501
# API docs:      http://localhost:8000/docs
# Grafana:       http://localhost:3000  (admin / legallens)
# MLflow:        http://localhost:5001
# Prometheus:    http://localhost:9090

How to Use

Streamlit UI (http://localhost:8501)

Agent Mode — Type a natural language question:

Query Tool Used What You Get
"Extract the non-compete clause" Clause Extractor Exact clause text + plain English summary
"What are the risks in termination provisions?" Risk Analyzer Risk list with severity + recommendations
"Summarize the key terms and obligations" Summarizer Executive summary, key terms, obligations
"What is the governing law in the Todos Medical agreement?" Clause Extractor Contract-specific clause with jurisdiction
"Is the distributor restricted from selling competing products?" Clause Extractor Cross-contract non-compete comparison

Search Mode — Returns raw retrieved passages with relevance scores. Shows what the retriever finds before the LLM processes it.

# Full agent pipeline
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Extract the non-compete clause", "top_k": 5}'

# Retrieval only (no LLM)
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "indemnification liability", "top_k": 10}'

# Health check
curl http://localhost:8000/health

Evaluation Results

Retrieval (Phase 2) — 41 Clause Types

Method MRR Recall@1 Recall@5 Recall@10 Recall@20
BM25 only 0.171 0.049 0.317 0.512 0.683
Dense only 0.175 0.073 0.317 0.439 0.610
Hybrid (RRF) 0.217 0.098 0.341 0.463 0.659
Hybrid + Rerank 0.290 0.171 0.415 0.561 0.659

Each layer adds measurable value. Hybrid + Rerank outperforms all individual methods.

Agent Quality (Phase 4) — LLM-as-a-Judge (1-5 scale)

Metric Score
Faithfulness (grounded in context) 4.60 / 5
Answer Relevance (addresses the question) 4.00 / 5
Context Relevance (right passages retrieved) 4.20 / 5
Overall 4.27 / 5

End-to-End Pipeline (Full RAG) — 20 Clause Types

Metric Score
Answer Recall 0.540
Accuracy (Recall ≥ 0.5) 55.0%
Answer F1 0.141
Queries evaluated 20

Top performing clauses: governing law (93% recall), termination for convenience (88%), no-solicit of employees (86%), audit rights (79%).

All metrics measured against expert lawyer annotations — 22,450 ground truth clause extractions labeled by trained law students and reviewed by experienced attorneys.

Note: F1 is low by design — the system generates analytical paragraphs while ground truth is short clause excerpts. Answer Recall is the primary metric for this analytical use case.

Project Structure

legallens/
├── config/
│   └── settings.py              # Pydantic-settings config (all params via .env)
├── src/
│   ├── data/
│   │   ├── download.py          # CUAD dataset downloader (GitHub → SQuAD JSON)
│   │   └── chunker.py           # Recursive + semantic chunking with metadata
│   ├── indexing/
│   │   ├── bm25_index.py        # BM25Okapi sparse index with legal stopwords
│   │   └── chroma_index.py      # ChromaDB dense index (all-MiniLM-L6-v2)
│   ├── retrieval/
│   │   ├── hybrid.py            # RRF fusion + contract-aware filtering
│   │   ├── reranker.py          # Cross-encoder reranking (ms-marco-MiniLM)
│   │   └── evaluator.py         # Retrieval eval with query extraction + token F1
│   ├── agent/
│   │   ├── tools.py             # 3 tools with neighbor-chunk context
│   │   └── graph.py             # LangGraph agent with intent classification
│   ├── evaluation/
│   │   ├── llm_judge.py         # LLM-as-a-judge (faithfulness, relevance, context)
│   │   └── tracker.py           # MLflow experiment logging
│   └── api/
│       ├── main.py              # FastAPI with Prometheus metrics
│       └── streamlit_app.py     # Streamlit UI
├── scripts/
│   ├── run_phase1.py            # Data pipeline orchestrator
│   ├── run_phase2.py            # Retrieval evaluation
│   ├── run_phase3.py            # Agent demo
│   ├── run_phase4.py            # LLM-as-judge evaluation
│   └── run_phase5.py            # Local server launcher
├── tests/
│   └── test_chunker.py          # Unit tests
├── monitoring/
│   └── prometheus.yml           # Prometheus scrape config
├── data/                        # DVC-tracked (not in git)
├── Dockerfile
├── docker-compose.yml           # 6-service stack
├── Makefile
├── pyproject.toml
├── requirements.txt
└── README.md

Tech Stack

Layer Tool Purpose
LLM Groq API (Llama 3.3 70B) Agent reasoning + evaluation
Embeddings all-MiniLM-L6-v2 Dense vector embeddings (384-dim)
Reranker ms-marco-MiniLM-L-6-v2 Cross-encoder reranking
Vector DB ChromaDB Persistent dense vector storage
Sparse Search rank-bm25 BM25Okapi keyword retrieval
Agent LangGraph Tool orchestration + state management
Evaluation LLM-as-a-Judge Faithfulness, relevance, context scoring
Tracking MLflow Experiment parameter/metric logging
Data Versioning DVC Git-based data versioning
API FastAPI REST API with Prometheus instrumentation
UI Streamlit Interactive query interface
Monitoring Prometheus + Grafana Latency and throughput dashboards
Containers Docker Compose One-command 6-service deployment

Total cost: $0 — All tools are free tier or run locally.


Key Design Decisions

Hybrid retrieval — BM25 catches exact keyword matches (legal terms, party names). Dense embeddings catch semantic similarity ("covenant not to compete" ≈ "non-compete clause"). RRF fusion combines both without score normalization.

Contract-aware retrieval — When a query mentions a specific company name, the system detects this and filters retrieval to that contract's chunks only. Without this, generic clause queries return results from whichever contract has the strongest keyword match.

Neighbor chunk context — Legal clauses often span chunk boundaries. When the retriever finds a relevant chunk, the system includes the previous and next chunks from the same contract, eliminating sentence cutoff problems.

LLM-as-a-Judge — Scores faithfulness, relevance, and context quality on a 1-5 scale with reasoning. Simpler and more stable than RAGAS, using the same Groq API.

Token overlap evaluation — CUAD ground truth answers average 400-600 chars but chunks are ~320 chars. Token-level F1 at 0.3 threshold handles answer-chunk boundary mismatches (same approach as SQuAD benchmarks).

Query extraction from CUAD templates — CUAD questions like "Highlight the parts related to 'Non-Compete'" get transformed to "non-compete clause" so BM25 gets useful signal instead of matching boilerplate.


Dataset

CUAD (Contract Understanding Atticus Dataset) — NeurIPS 2021

  • 406 unique contracts from SEC EDGAR filings
  • 22,450 QA annotations by expert lawyers
  • 41 clause types (termination, indemnification, non-compete, IP, governing law, etc.)
  • 69,522 indexed chunks (512 chars, 128 overlap)
  • Licensed under CC BY 4.0

Docker Services

Service Port Purpose
api 8000 FastAPI REST API
streamlit 8501 Web UI
chromadb 8100 Vector database
mlflow 5001 Experiment tracking
prometheus 9090 Metrics collection
grafana 3000 Monitoring dashboards (admin / legallens)
docker compose up -d --build    # Start all
docker compose ps               # Check status
docker compose logs -f api      # Follow API logs
docker compose down             # Stop all

Makefile Commands

make setup           # Install deps + init DVC
make phase1          # Data pipeline (download + chunk + index)
make phase2          # Retrieval evaluation
make phase3          # Agent demo (3 queries)
make phase4          # LLM-as-judge evaluation
make phase5          # Start FastAPI + Streamlit locally
make docker-up       # Start Docker stack
make docker-down     # Stop Docker stack
make test            # Run unit tests
make lint            # Run ruff linter

Future Improvements

  • Contract selector UI — Dropdown to pick a specific contract before querying
  • PDF upload — Allow users to upload their own contracts for analysis
  • Multi-contract comparison — "Compare indemnification across all software agreements"
  • Fine-tuned embeddings — Domain-specific legal embeddings for better retrieval
  • Semantic chunking — Embedding similarity breakpoints instead of fixed-size splits
  • Output guardrails — JSON validation and hallucination detection

Citation

@article{hendrycks2021cuad,
  title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
  author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
  journal={NeurIPS},
  year={2021}
}

License

MIT

About

End-to-end Agentic RAG system for legal contract analysis — hybrid retrieval (BM25 + dense + RRF), cross-encoder reranking, LangGraph agent with 3 tools, LLM-as-a-judge evaluation, FastAPI + Streamlit + Docker Compose. Built on CUAD (406 contracts, 22K expert annotations). Zero cost — all free-tier tools.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors