Who is this for? You don't need a pharma background to understand this. If you've ever Googled "is it safe to take drug X with drug Y?" and gotten a confusing wall of text, this project automates that research — but at a clinical level, across thousands of medical documents, in under 60 seconds.
A hospital pharmacist or pharmaceutical company safety officer is asked: "Is it safe for a 68-year-old patient with kidney disease to take metformin, lisinopril, and warfarin together?"
To answer this responsibly, they must:
- Search FDA's DailyMed for each drug's label (warnings, dosing adjustments, contraindications)
- Cross-reference drug-drug interaction databases
- Check medical literature (PubMed) for published studies on this combination
- Review clinical guidelines for kidney disease-specific recommendations
- Synthesize all findings into a safety assessment
This takes 2–4 hours per query. In a hospital setting, dozens of these reviews happen daily. At pharmaceutical companies, safety officers review thousands of drug combinations for trial design, regulatory submissions, and adverse event monitoring.
- Speed: Delayed reviews slow clinical trials, drug approvals, and patient care
- Inconsistency: Different analysts reading the same documents reach different conclusions
- Scale: A drug with 10 known interactions creates 45 unique pair combinations; with 100 drugs it's 4,950 pairs
- Hallucination risk: Off-the-shelf AI (like ChatGPT) confidently invents drug interactions that don't exist — in healthcare, a wrong answer can harm patients
PharmAgent is an Agentic RAG (Retrieval-Augmented Generation) system that autonomously:
- Decomposes complex drug safety questions into targeted sub-queries
- Retrieves evidence from three distinct medical knowledge bases simultaneously
- Grades every retrieved document for relevance (discards irrelevant chunks)
- Self-corrects when retrieval fails by rewriting the query
- Synthesizes a structured safety assessment with citations
- Verifies its own answer against source documents before returning it
Result: Drug safety reviews in under 60 seconds, with every claim cited to a source document.
Standard AI (vanilla RAG) fails on this problem because:
| Problem | Why vanilla RAG fails | How PharmAgent solves it |
|---|---|---|
| Multi-source reasoning | Single retrieve-then-answer pass can't cross-reference 3 databases | Routes sub-queries to the right knowledge base |
| Hallucination | LLMs invent plausible-sounding interactions | Hallucination grader rejects uncited claims |
| Query complexity | "Is it safe for a patient with X condition taking Y and Z?" requires parallel reasoning | Query decomposition breaks it into answerable sub-questions |
| Out-of-vocabulary drugs | AI guesses when it doesn't know | Hard rejection with explicit warning for unknown drugs |
| False premises | "What dose reduction prevents serotonin syndrome from warfarin + lisinopril?" (serotonin syndrome is impossible here) | Agent detects and rejects the false premise rather than inventing an answer |
User Query
│
▼
┌─────────────────────┐
│ Node 1: Analyze │ Classifies query type, identifies drugs,
│ & Route │ selects knowledge bases to search
└─────────┬───────────┘
│ valid query invalid query ──► REJECT (with reason)
▼
┌─────────────────────┐
│ Node 2: Retrieve │ Hybrid search (BM25 + dense vectors) across
│ │ FDA labels, PubMed, clinical guidelines
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Node 3: Grade Docs │ LLM scores each chunk for relevance
│ │ Irrelevant chunks discarded
└─────────┬───────────┘
│ enough relevant docs?
│ NO ──► ┌──────────────────┐
│ │ Node 4: Rewrite │ ──► back to Node 2 (max 2 retries)
│ └──────────────────┘
│ YES
▼
┌─────────────────────┐
│ Node 5: Generate │ Llama 3.3 70B synthesizes structured
│ │ safety assessment with inline citations
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Node 6: Check │ Verifies: (a) every claim is in source docs
│ Hallucination │ (b) answer actually addresses the question
└─────────┬───────────┘
│ failed? ──► back to Node 5 (max 1 retry)
│ passed?
▼
Final Safety Assessment
(Risk Level · Evidence · Contraindications · Monitoring · Citations)
| Source | What it contains | Why we use it |
|---|---|---|
| FDA DailyMed | Official drug package inserts (~150,000 drugs) | Ground truth for warnings, dosing, contraindications |
| PubMed (MedRAG) | 23.9M biomedical research snippets | Published evidence for interactions and adverse events |
| StatPearls | 9,330 clinical reference articles | Evidence-based clinical guidelines and protocols |
Each knowledge base uses hybrid search combining:
- BM25 (keyword matching) — catches exact drug name and medical term matches
- Dense vector search (semantic similarity via
all-MiniLM-L6-v2embeddings in ChromaDB) — catches conceptual matches even when exact words differ - Reciprocal Rank Fusion (RRF) — merges both result lists into a unified ranking
- Cross-encoder reranking — final precision pass before documents reach the grader
| Task | Model | Reason |
|---|---|---|
| Query classification, grading, hallucination check | Llama 3.1 8B (Groq) | Fast, cheap, sufficient for structured binary decisions |
| Safety assessment synthesis | Llama 3.3 70B (Groq) | Maximum reasoning capability for the critical generation step |
Total cost: $0. Everything runs on Groq's free tier.
We tested PharmAgent against 5 complex pharmacovigilance scenarios designed to stress-test the system. Target drugs: Aspirin, Lisinopril, Metformin, Semaglutide, Warfarin.
| Test Case | Query Type | Result | Risk Level Returned | Notes |
|---|---|---|---|---|
| Semaglutide + Warfarin (pharmacokinetic mechanism) | Multi-hop reasoning | Safe Failure | MODERATE (50%) | Did NOT hallucinate the mechanism. Correctly admitted missing data. |
| Aspirin + Lisinopril post-MI with CKD | Guideline contradiction | Partial Success | MODERATE (60%) | Identified drug clash; missed guideline retrieval |
| Lisinopril + Warfarin → "Serotonin Syndrome" | False premise detection | Complete Success | Rejected false premise | Refused to invent a non-existent interaction |
| Metformin + Lisinopril + Dehydration (lactic acidosis risk) | Environmental trigger → biochemical cascade | Strong Success | HIGH (80%) | Correctly connected dehydration → AKI → lactic acidosis |
| Aspirin + Warfarin + Fish Oil + OTC cold meds (triple threat) | Multi-part out-of-vocabulary | Safe Failure | HIGH (40%) with OOV warning | Correctly flagged unknown ingredients; flagged GI bleed risk |
What works exceptionally well:
- Hallucination prevention: The system refuses to invent medical mechanisms when source documents don't support them. In the false-premise test (serotonin syndrome), it correctly rejected the question rather than fabricating an answer.
- Safety-first failure mode: When the system can't fully answer, it returns a conservative risk level with explicit uncertainty rather than a confident wrong answer.
- Out-of-vocabulary handling: Unknown drugs trigger explicit warnings ("This drug is NOT in the knowledge base") instead of silent guesses.
Current limitations:
- Multi-hop decomposition: Complex queries requiring the agent to first look up a mechanism (e.g., "semaglutide slows gastric emptying") and then apply it to a second drug interaction require explicit query decomposition that the current router doesn't always perform.
- 5-drug scope: The demo knowledge base covers only 5 drugs. Scaling to the full DailyMed library requires production infrastructure.
| Component | Technology | Why |
|---|---|---|
| Agent orchestration | LangGraph | Industry standard for stateful, cyclic AI workflows with checkpointing |
| LLM inference | Groq API (free tier) | Sub-second latency; Llama 3.3 70B quality at zero cost |
| Vector database | ChromaDB | Zero-config, embedded, Apache 2.0 licensed |
| Keyword search | rank-bm25 | Complementary to dense retrieval; critical for exact drug name matching |
| Embeddings | all-MiniLM-L6-v2 (sentence-transformers) | Free local inference; biomedical text performance sufficient for this scope |
| Drug label data | DailyMed (NLM) | Official FDA-approved drug package inserts |
| Literature data | MedRAG/PubMed (HuggingFace) | Pre-processed for RAG; 23.9M snippets |
| Clinical guidelines | MedRAG/StatPearls (HuggingFace) | Peer-reviewed clinical reference |
| Evaluation | RAGAS + DeepEval | Open-source RAG evaluation; measures faithfulness, context precision |
| UI | Streamlit | Rapid deployment; free hosting on Community Cloud |
| Observability | Langfuse (self-hosted) | Traces every agent decision, retrieval, and generation step |
| Configuration | pydantic-settings | Type-safe environment variable management |
| Logging | structlog | Structured JSON logs for every agent node decision |
- Python 3.11+
- Groq API key (free, no credit card required)
git clone https://github.com/<your-username>/pharma-agent.git
cd pharma-agent
pip install -e ".[dev]"cp .env.example .env
# Add your GROQ_API_KEY to .envpython scripts/ingest_demo.pyThis downloads and indexes FDA labels for Aspirin, Lisinopril, Metformin, Semaglutide, and Warfarin into ChromaDB and BM25 stores.
streamlit run pharmagent/ui/app.pypip install -e ".[eval]"
python pharmagent/evaluation/run_eval.pypharmagent/
├── agent/
│ ├── graph.py # LangGraph state graph definition (the 6-node loop)
│ ├── nodes.py # Each node's logic (analyze, retrieve, grade, rewrite, generate, check)
│ ├── state.py # AgentState TypedDict
│ └── llm.py # LLM clients + token budget tracker
├── core/
│ ├── hybrid_retriever.py # BM25 + ChromaDB fusion with cross-encoder reranking
│ ├── document_grader.py # LLM-based relevance scoring
│ ├── query_rewriter.py # Query reformulation for failed retrievals
│ ├── synthesizer.py # Safety assessment generation
│ ├── hallucination_checker.py # Faithfulness + answer relevance verification
│ ├── safety_guardrails.py # Deterministic guardrails (OOV detection, risk escalation)
│ ├── schemas.py # SafetyAssessment Pydantic model
│ ├── vectorstore.py # ChromaDB client
│ ├── bm25_store.py # BM25 index persistence
│ └── embeddings.py # Sentence-transformer embeddings
├── ingestion/
│ ├── dailymed.py # DailyMed XML download and parsing
│ ├── medrag_loader.py # PubMed + StatPearls HuggingFace dataset loader
│ ├── chunker.py # Text chunking for drug labels
│ └── build_index.py # Full index build pipeline
├── evaluation/
│ ├── golden_set.py # Ground truth Q&A pairs
│ └── run_eval.py # RAGAS + DeepEval evaluation runner
└── ui/
└── app.py # Streamlit interface
scripts/
└── ingest_demo.py # One-command demo data ingestion
Try these in the Streamlit UI:
- "What are the contraindications for metformin in patients with renal impairment?"
- "What are the risks of taking warfarin and aspirin together?"
- "Is semaglutide safe for a patient with a history of pancreatitis?"
- "Is metformin safe for a 68-year-old patient with stage 3 CKD who is also taking lisinopril and warfarin?"
- "A patient taking metformin develops severe vomiting and dehydration — what are the immediate biochemical dangers?"
| Metric | Target | What it measures |
|---|---|---|
| Review time per query | < 60 seconds | Speed vs. 2–4 hour manual baseline |
| Retrieval Precision@10 (RAGAS) | > 0.75 | Are the right documents retrieved? |
| Faithfulness score (RAGAS) | > 0.90 | Are all claims grounded in source documents? |
| Hallucination rate | < 6% | How often does the system invent unsupported facts? |
| Multi-hop accuracy | > 85% | Complex multi-drug queries answered correctly vs. pharmacist ground truth |
This tool is designed for:
- Hospital pharmacists conducting medication reconciliation
- Pharmaceutical safety officers reviewing adverse event signals
- Clinical researchers evaluating drug combinations for trial design
- Regulatory affairs teams preparing FDA submissions
MIT