VERIFAI — Verified Evidence-based Radiology Interpretive Framework with Agentic Intelligence

A multi-agent AI system for chest X-ray interpretation that produces clinically trustworthy, evidence-backed diagnoses with built-in safety guardrails, adversarial debate, and human-in-the-loop review.

VERIFAI orchestrates nine specialized AI agents through a LangGraph state machine to analyze chest X-rays, cross-reference patient history and medical literature, debate diagnostic uncertainty, and produce auditable diagnoses — all runnable on a single A100 / 24 GB+ GPU.

Architecture Overview
Agent Pipeline
Monotonic Uncertainty Cascade (MUC)
Key Features
Project Structure
Prerequisites
Installation
Environment Configuration
Running the System
Building the FHIR + FAISS Retrieval Index
Frontend Dashboard
Observability & Monitoring
Testing
API Reference
Configuration Reference
Technical Design Decisions

Architecture Overview

flowchart TD
    A[🩻 Chest X-ray Input] --> B[Radiologist Agent\nMedGemma 4B-IT + MedSigLIP]
    B --> C[CheXbert Agent\nStructured 14-label extraction]
    C --> D{Parallel Evidence Gathering}
    D --> E[Historian Agent\nFHIR + DuckDB + FAISS]
    D --> F[Literature Agent\nPubMed + Europe PMC]
    E --> G[Critic Agent\nAdversarial verification]
    F --> G
    G --> H[Debate\nDempster-Shafer fusion\nUp to 3 rounds]
    H --> I[Validator\nCXR-RePaiR + RadGraph + Rules Engine]
    I --> J{Verdict}
    J -- FINALIZE --> K[✅ Final Diagnosis\nSHA-256 Hash + Evidence Packet]
    J -- FLAG_FOR_HUMAN --> L[🔵 Human-in-the-Loop Review\nApprove / Reject + Re-enter at Critic]
    L -- Approved --> K
    L -- Rejected --> G

Agent Pipeline

#	Agent	Role	Model / Tool	Output
1	Radiologist	Analyze CXR images, generate free-text findings & impression	MedGemma 4B-IT (base) + MedSigLIP fine-tuned classifier + LRP heatmaps	Findings, impression, disease probabilities, attention heatmaps
2	CheXbert	Extract 14 structured pathology labels from the report text	`f1chexbert` BERT model	Present / absent / uncertain per label
3	Historian	Retrieve supporting & contradicting patient history from EHR	DuckDB + FAISS vector search + shared MedGemma	Clinical facts, FHIR resource IDs, clinical summary
4	Literature	Search PubMed / Europe PMC for biomedical evidence	BioPython E-Utilities + Europe PMC REST API	Ranked citations, evidence synthesis
5	Critic	Adversarial validation — detect overconfidence, surface past mistakes	Rule-based analysis + Past-Mistakes FAISS	Safety score, concern flags, uncertainty delta
6	Debate	Multi-round structured debate with Dempster-Shafer fusion	LangGraph orchestration (up to 3 rounds)	Consensus direction, confidence adjustment
7	Validator	Three-layer quality gate before finalizing	MedSigLIP FAISS (visual) + RadGraph NLP + Rules Engine	`FINALIZE` / `FINALIZE_LOW_CONFIDENCE` / `FLAG_FOR_HUMAN`
8	Finalize	Build the final diagnosis with a reproducibility hash	SHA-256 + Pydantic	`FinalDiagnosis` (diagnosis, confidence, hash, evidence packet)
9	Human Review	Doctor approves or rejects with feedback	LangGraph `interrupt()`	Approve → finalize; Reject → re-enter pipeline at Critic

Pipeline Sequence

sequenceDiagram
    participant U as Clinician
    participant R as Radiologist
    participant C as CheXbert
    participant H as Historian
    participant L as Literature
    participant Cr as Critic
    participant D as Debate
    participant V as Validator
    participant F as Finalize

    U->>R: Upload CXR (AP/PA/Lateral)
    R-->>C: Free-text findings + impression
    C-->>H: 14 structured pathology labels
    C-->>L: 14 structured pathology labels
    par Evidence Gathering (parallel)
        H-->>Cr: FHIR clinical facts
        L-->>Cr: PubMed citations
    end
    Cr-->>D: Safety score + concerns
    D-->>V: Consensus + adjusted confidence
    V-->>F: Recommendation
    F-->>U: Final diagnosis + SHA-256 hash
    alt Flagged for Human Review
        F-->>U: Review panel — Approve or Reject
        U-->>Cr: Feedback injected, pipeline re-runs
    end

Monotonic Uncertainty Cascade (MUC)

Every agent updates a single shared value system_uncertainty ∈ [0.05, 0.95] through Bayesian log-odds updates. This gives the system a live, cascading confidence measure that rises when agents disagree and falls when they confirm each other.

Core update rule:

IG(k)      = α · confidence(k) + β · alignment(k) + γ · direction(k)
U_sys(k)   = clamp( U_sys(k-1) − IG(k), 0.05, 0.95 )

U_sys starts at 1.0 (maximum uncertainty) and decreases as confirming evidence arrives. A contradiction (direction = −1) or low confidence reduces IG, keeping entropy high.

graph LR
    A["U_sys = 1.0\nat start"] --> B[Radiologist\naligns with image → ↓ U]
    B --> C[CheXbert\nlabel match → ↓ U]
    C --> D[Historian/Literature\ncontradiction → ↑ U]
    D --> E[Critic\npast mistake found → ↑ U]
    E --> F[Debate consensus → ↓ U]
    F --> G["U_sys final\n(shown in UI chart)"]

Agent	Confirming Signal	Contradicting Signal
Radiologist	High MedSigLIP disease probability	Low classification score
CheXbert	Label matches impression	Label absent in free text
Historian	FHIR conditions support the diagnosis	Contradicting clinical facts
Literature	Evidence corroborates findings	Literature contradicts impression
Critic	No overconfidence / no past mistakes	Overconfidence flag raised
Debate	All agents reach consensus	Agents violently disagree (K ≥ 0.99)
Validator	High visual retrieval similarity + entity match	Entity F1 weak / retrieval consensus mismatch

The Debate stage uses Dempster-Shafer evidence fusion. When agents strongly conflict (K ≥ 0.99), the system remains uncertain rather than picking a winner arbitrarily.

The uncertainty history is streamed live to the frontend and rendered as an SVG line chart in the results page.

Key Features

Core

Multi-Agent Orchestration — 9 specialized agents coordinated via LangGraph state machine with typed state
Multi-View CXR Support — Accepts AP, PA, and Lateral views simultaneously via <PA>, <AP>, <LATERAL> tokens
MUC Uncertainty Framework — Real-time, bidirectional cascading system entropy across the full pipeline
Multi-Agent Debate — Up to 3 Dempster-Shafer-fused debate rounds before consensus
Human-in-the-Loop — LangGraph interrupt-based doctor review; rejected diagnoses re-enter at the Critic with full context

Safety & Trust

Medical Safety Guardrails — Rule-based + embedding-based checks for critical finding hallucinations
Reproducibility Hash — SHA-256 fingerprint (image + FHIR + config) for FDA 21 CFR Part 11 audit trail
Past Mistakes Memory — DuckDB HNSW vector index stores rejected diagnoses; Critic retrieves similar past errors via neural re-ranking (temporal recency + clinical relevance scoring)
CheXbert Cross-Validation — Structured labels act as a second-opinion independent of the Radiologist's free text
Validator Quality Gate — Three layers: visual FAISS retrieval (MedSigLIP), RadGraph NLP entity matching, clinical rules engine

Infrastructure

SSE Real-Time Streaming — Live agent progress streamed from backend to frontend via Server-Sent Events
LRP Heatmaps — Transformer explainability via Chefer et al. (CVPR 2021) Layer-wise Relevance Propagation for MedSigLIP
Observability Dashboard — Prometheus-style metrics (latency, confidence, safety scores, per-agent duration)
Evidence Report Generator — HTML reports with citations, heatmaps, and full audit trail
Shared Model Loader — Historian, Literature, and Critic share a single thread-safe MedGemma 4B instance (~9 GB VRAM total vs ~27 GB otherwise)

Project Structure

VERIFAI/
├── agents/                      # AI Agent implementations
│   ├── radiologist/             # MedGemma 4B-IT VLM + MedSigLIP classifier + LRP
│   │   ├── model.py             # Model loading (FP16) + VLM inference
│   │   ├── agent.py             # Radiologist agent logic
│   │   ├── classifier.py        # MedGemmaVisionHead: frozen SigLIP + trainable head
│   │   ├── lrp.py               # Chefer et al. (CVPR 2021) LRP for SigLIP
│   │   └── prompts.py           # Structured JSON generation prompts
│   ├── chexbert/                # CheXbert structured labeling
│   │   ├── agent.py             # Extract 14 CXR condition labels
│   │   └── model.py             # f1chexbert BERT model wrapper + transformers 5.x patch
│   ├── historian/               # FHIR patient history retrieval + clinical reasoning
│   │   ├── agent.py             # DuckDB + FAISS vector search orchestrator
│   │   ├── fhir_client.py       # FHIR R4 resource parser + FAISS retriever
│   │   └── reasoner.py          # Clinical reasoning synthesizer (via shared MedGemma)
│   ├── literature/              # PubMed / Europe PMC search
│   │   ├── agent.py             # Literature search orchestrator
│   │   ├── pubmed_entrez.py     # BioPython E-Utilities wrapper
│   │   ├── europe_pmc.py        # Europe PMC REST API
│   │   └── rate_limiter.py      # Adaptive rate limiter (3 req/s NCBI)
│   ├── critic/                  # Adversarial verification
│   │   ├── agent.py             # Overconfidence detection + past mistakes retrieval
│   │   └── model.py             # Rule-based linguistic certainty analysis
│   ├── debate/                  # Multi-agent debate protocol
│   │   └── agent.py             # 3-round Dempster-Shafer debate
│   ├── validator/               # Final quality gate
│   │   ├── agent.py             # Validator orchestrator
│   │   ├── retrieval_tool.py    # CXR-RePaiR: MedSigLIP FAISS visual retrieval
│   │   ├── radgraph_tool.py     # RadGraph NLP entity matching + transformers 5.x patch
│   │   └── rules_engine.py      # Clinical rules engine
│   └── feedback/                # Doctor feedback handler
│       └── agent.py             # Rejection → re-enter pipeline at Critic
│
├── graph/                       # LangGraph workflow
│   ├── state.py                 # VerifaiState TypedDict + Pydantic models
│   ├── workflow.py              # Full graph + node wrappers + interrupt()
│   └── router.py                # Uncertainty-based routing logic
│
├── app/                         # FastAPI backend
│   ├── main.py                  # App entry point + middleware
│   ├── api.py                   # REST endpoints (start, status, resume, SSE)
│   ├── config.py                # Settings (models, thresholds, feature flags)
│   ├── streaming.py             # SSE event bus
│   └── shared_model_loader.py   # Thread-safe MedGemma singleton
│
├── frontend/                    # Next.js 14 dashboard (TypeScript + Tailwind CSS)
│   └── src/app/
│       ├── page.tsx             # Landing page — "The Council"
│       ├── diagnose/page.tsx    # Upload X-ray + start workflow
│       ├── results/[id]/page.tsx # Live results + SSE agent feed + HITL review
│       └── observability/page.tsx # Metrics dashboard
│
├── db/                          # Database layer
│   ├── logger.py                # Session-scoped structured logging (SQLite & Supabase)
│   ├── connection.py            # SQLite connection pool + schema
│   ├── adapter.py               # Database adapter (sqlite / supabase)
│   ├── past_mistakes.py         # Past Mistakes DB (DuckDB + HNSW vector)
│   └── rerank_mistakes.py       # Neural re-ranking (temporal decay + clinical relevance)
│
├── uncertainty/                 # MUC framework
│   ├── muc.py                   # Monotonic Uncertainty Cascade (Information Gain)
│   └── kle.py                   # KL-divergence Epistemic uncertainty
│
├── safety/                      # Medical safety guardrails
│   └── guardrails.py
│
├── monitoring/                  # Observability
│   └── metrics.py               # Prometheus-style counters + histograms
│
├── utils/
│   ├── evidence_report.py       # HTML evidence report generator
│   └── inference.py             # Robust JSON extraction from LLM output
│
├── tests/
│   ├── test_workflow.py         # Primary end-to-end integration test
│   └── ...                      # Additional unit tests
│
├── scripts/
│   ├── build_retrieval_index.py # Build FAISS index from MIMIC-CXR
│   ├── install_radgraph_model.py# RadGraph model first-time setup
│   └── seed_pb.py               # Seed past mistakes database
│
├── train_classifier.py          # MedSigLIP classifier training
├── qlora_medgemma.py            # QLoRA fine-tuning pipeline (optional)
└── docs/
    └── PRINCIPLED_UNCERTAINTY.md # Full MUC derivation

Prerequisites

Requirement	Minimum	Recommended
Python	3.10	3.10
CUDA	12.1	12.6
GPU VRAM	24 GB	80 GB (A100)
RAM	32 GB	64 GB
Disk	30 GB	60 GB
Node.js	18	20+
OS	Ubuntu 22.04	Ubuntu 22.04+

Required accounts:

Hugging Face — Token for gated models (google/medgemma-1.5-4b-it, google/medsiglip-448)
NCBI — API key for PubMed access (recommended)
(Optional) Supabase — For cloud database logging

Installation

1. Clone the Repository

git clone https://github.com/your-org/VERIFAI.git
cd VERIFAI

2. Set Up Conda Environment

conda create -n verifai python=3.10 -y
conda activate verifai

3. Install PyTorch with CUDA

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"

4. Install Python Dependencies

pip install -r requirements.txt

5. Install NLTK Data

python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

6. Install RadGraph Model (for NLP validation)

python scripts/install_radgraph_model.py

RadGraph will auto-download from Hugging Face on first use if the local path is not found.

7. Install Frontend

cd frontend
npm install
cd ..

Environment Configuration

cp .env.example .env

Edit .env with your values:

# ── Required ──────────────────────────────────────────────────────
HUGGINGFACE_TOKEN=hf_your_token_here    # MedGemma/MedSigLIP access

# ── PubMed (Recommended) ──────────────────────────────────────────
NCBI_EMAIL=your.email@example.com       # Required by NCBI policy
NCBI_API_KEY=your_ncbi_key              # Enables 10 req/s (vs 3 req/s)

# ── Database ──────────────────────────────────────────────────────
DATABASE_MODE=sqlite                    # Use sqlite for local / supabase for cloud

# ── Models ────────────────────────────────────────────────────────
MEDGEMMA_4B_MODEL=google/medgemma-1.5-4b-it
MEDSIGLIP_BASE_MODEL=google/medsiglip-448
MEDSIGLIP_WEIGHTS_PATH=medsiglip_full_model.pt   # Local fine-tuned classifier weights

# ── Optional: Supabase Cloud DB ───────────────────────────────────
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_KEY=your_anon_key
SUPABASE_SERVICE_KEY=your_service_key

# ── Optional: Semantic Scholar ────────────────────────────────────
SEMANTIC_SCHOLAR_API_KEY=your_key

# ── Optional: Workflow Flags ──────────────────────────────────────
MOCK_MODELS=False       # True = skip model download, use mocks (no GPU needed)
ENABLE_LLM_CRITIC=False # Extra MedGemma semantic critic pass (slower)

Set MOCK_MODELS=True for a first run to verify the pipeline wiring before downloading model weights.

Running the System

Quick Start — Full Pipeline Test (No Frontend)

conda activate verifai
export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python tests/test_workflow.py

This runs all 9 agents on a sample chest X-ray and prints:

Radiologist findings & impression
CheXbert structured labels
Critic safety assessment
Debate consensus
Final diagnosis with confidence & reproducibility hash
Full audit trace

Start Full Stack (Backend + Frontend)

# Terminal 1 — Backend
conda activate verifai
export CUDA_VISIBLE_DEVICES=0
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000

# Terminal 2 — Frontend
cd frontend
npm run dev
# → http://localhost:3000

API Only (cURL)

# Start workflow
curl -X POST http://localhost:8000/api/v1/workflows/start \
  -F "images=@chest_xray_AP.jpg" \
  -F "views=AP" \
  -F "patient_id=patient-123"

# Response:
# {"session_id": "abc-123"}

# Poll status
curl http://localhost:8000/api/v1/workflows/abc-123/status

# Subscribe to live SSE progress
curl -N http://localhost:8000/api/v1/workflows/abc-123/stream

Models Used

Model	Source	Purpose
MedGemma 4B-IT	`google/medgemma-1.5-4b-it` (HuggingFace, gated)	Primary VLM for Radiologist, Historian, Literature synthesis
MedSigLIP	`google/medsiglip-448` (HuggingFace, gated) + local `medsiglip_full_model.pt`	Fine-tuned disease classifier + visual FAISS retrieval
CheXbert (f1chexbert)	`f1chexbert` pip package	14-label structured pathology labeling
RadGraph	Auto-downloaded from HuggingFace	Clinical NLP entity extraction for Validator
all-MiniLM-L6-v2	`sentence-transformers/all-MiniLM-L6-v2`	Semantic embeddings for FHIR retrieval + Critic

MedGemma is used as the BASE model (no LoRA adapters). The fine-tuned MedSigLIP classifier (medsiglip_full_model.pt) must be placed at the path specified by MEDSIGLIP_WEIGHTS_PATH in .env.

Building the FHIR + FAISS Retrieval Index

The Historian agent uses a FAISS vector index over patient FHIR records for semantic retrieval.

Step 1: Extract FHIR Bundles to DuckDB

python extract_fhir_to_duckdb.py \
  --fhir_dir path/to/fhir/bundles \
  --output verifai_fhir.duckdb

Step 2: Build FAISS Index

python scripts/build_retrieval_index.py \
  --duckdb_path verifai_fhir.duckdb \
  --output_faiss verifai_fhir.faiss \
  --output_mapping verifai_fhir_mapping.json

This produces:

verifai_fhir.faiss — Vector index for fast similarity search
verifai_fhir_mapping.json — Maps FAISS IDs to FHIR resource IDs
verifai_fhir.duckdb — Structured patient data for SQL queries

Step 3: (Optional) Seed Past Mistakes Database

python scripts/seed_pb.py

Frontend Dashboard

The frontend is a Next.js 14 app (TypeScript + Tailwind CSS) with a premium dark-themed clinical interface.

Pages

Route	Description
`/`	Landing page — "The Council" of 6 agents
`/diagnose`	Upload CXR + optional FHIR report + patient ID
`/results/[session_id]`	Live SSE agent feed → final diagnosis with evidence tabs
`/observability`	System metrics dashboard (auto-refreshes every 15s)

Results Page Tabs

Tab	Content
Visual Proof	Original DICOM + MedSigLIP LRP heatmap side-by-side
Clinical	FHIR supporting & contradicting facts from Historian
Literature	PubMed citations with relevance summaries
Safety	Safety guardrails report (score, flags, critical findings)
Audit Trail	SHA-256 hash + full execution trace

Human-in-the-Loop Panel

When the Validator flags a case for human review, a Human-in-the-Loop panel appears on the results page. The clinician can:

Approve the diagnosis — marks the workflow as completed
Provide Feedback + Rerun — injects clinical notes, re-runs the pipeline from the Critic with a live SSE feed showing the rerun progress

Observability & Monitoring

# Metrics snapshot (JSON)
curl http://localhost:8000/api/v1/metrics/summary

Returns:

system — total workflows, active, deferrals, critical findings
agents — per-agent latency (mean, p95, p99), invocation counts
diagnostics — confidence, uncertainty, safety score distributions, debate rounds
safety — safety flags, errors by component

Visual dashboard available at http://localhost:3000/observability.

When running via test_workflow.py, metrics are saved to metrics_snapshot.json and automatically served by the API.

Database Logging

Mode	When to use	Tables
`DATABASE_MODE=sqlite`	Local testing	SQLite at `verifai_logs.db`
`DATABASE_MODE=supabase`	Production / cloud	Supabase PostgreSQL (see `db/supabase_schema.sql`)

Key tables: workflow_sessions, agent_invocations, radiologist_logs, critic_logs, historian_logs, debate_logs, validator_logs, doctor_feedback, past_mistakes.

Testing

# Full end-to-end integration test (recommended first test)
conda activate verifai
export CUDA_VISIBLE_DEVICES=0
python tests/test_workflow.py

# Unit tests
pytest tests/ -v

# Frontend build verification
cd frontend && npx next build

API Reference

Method	Endpoint	Description
`POST`	`/api/v1/workflows/start`	Upload image(s) + FHIR data, start async workflow
`GET`	`/api/v1/workflows/{id}/status`	Poll workflow status + results
`GET`	`/api/v1/workflows/{id}/stream`	SSE stream for live agent progress
`POST`	`/api/v1/workflows/{id}/resume`	Submit doctor feedback (approve / reject)
`POST`	`/api/v1/safety/validate`	Run safety guardrails on a diagnosis string
`GET`	`/api/v1/metrics/summary`	Observability metrics
`GET`	`/api/v1/health`	Server health check
`POST`	`/api/past-mistakes/insert`	Insert a validated diagnostic mistake
`POST`	`/api/past-mistakes/search`	Search for similar past mistakes
`GET`	`/api/past-mistakes/statistics`	Aggregate past-mistakes statistics

Example: Start Workflow

curl -X POST http://localhost:8000/api/v1/workflows/start \
  -F "images=@chest_xray_AP.jpg" \
  -F "images=@chest_xray_LAT.jpg" \
  -F "views=AP" \
  -F "views=LATERAL" \
  -F "patient_id=patient-123" \
  -F "fhir_report=@patient_fhir_bundle.json"

Example: Status Response

{
  "session_id": "abc-123",
  "status": "completed",
  "final_result": {
    "diagnosis": "Right lower lobe pneumonia with associated pleural effusion",
    "confidence": 0.87,
    "reproducibility_hash": "a3f9c2e1b7d4082f...",
    "evidence_packet": { "...": "..." },
    "trace": ["[RAD] Findings generated", "[CHEXBERT] 3 labels found", "..."]
  }
}

Configuration Reference

All settings live in app/config.py and can be overridden via .env:

Models

Variable	Default	Description
`MEDGEMMA_4B_MODEL`	`google/medgemma-1.5-4b-it`	Base MedGemma model (no LoRA)
`MEDSIGLIP_BASE_MODEL`	`google/medsiglip-448`	Vision encoder backbone
`MEDSIGLIP_WEIGHTS_PATH`	`medsiglip_full_model.pt`	Local fine-tuned classifier checkpoint
`TEXT_EMBEDDING_MODEL`	`sentence-transformers/all-MiniLM-L6-v2`	Semantic embeddings
`MOCK_MODELS`	`False`	Skip real models, use mocks

Workflow

Variable	Default	Description
`DATABASE_MODE`	`sqlite`	`sqlite` (local) or `supabase` (cloud)
`DEBATE_MAX_ROUNDS`	`3`	Maximum debate rounds
`DEBATE_CONSENSUS_THRESHOLD`	`0.15`	Max disagreement delta for consensus
`MAX_ROUTING_STEPS`	`5`	Prevent infinite routing loops
`ENABLE_LLM_CRITIC`	`False`	Enable MedGemma semantic critic pass
`ENABLE_PAST_MISTAKES_MEMORY`	`True`	Historical error retrieval at Critic
`ENABLE_DOCTOR_FEEDBACK`	`True`	Enable feedback reprocessing loop
`USE_DEBATE_WORKFLOW`	`True`	Enable multi-agent debate
`USE_PARALLEL_AGENTS`	`True`	Run Historian + Literature in parallel

API Keys

Variable	Required	Description
`HUGGINGFACE_TOKEN`	✅	Access gated HuggingFace models
`NCBI_EMAIL`	✅	PubMed API policy requirement
`NCBI_API_KEY`	Recommended	Higher rate limits (10 req/s)
`SEMANTIC_SCHOLAR_API_KEY`	Optional	Semantic Scholar access
`SUPABASE_URL` / `SUPABASE_KEY`	Optional	Cloud database logging

Technical Design Decisions

Why Multi-Agent (Not a Single Model)?

A single LLM producing diagnoses has no internal checks — it can hallucinate confidently. VERIFAI uses structured adversarial verification:

graph LR
    A["Radiologist\n(may hallucinate)"] -->|Free-text report| B["CheXbert\n(independent label check)"]
    B -->|14 structured labels| C["Critic\n(challenges confidence)"]
    A -->|Impression| D["Literature\n(evidence check)"]
    A -->|Impression| E["Historian\n(FHIR cross-check)"]
    C & D & E -->|Multi-agent debate| F["Validator\n(3-layer quality gate)"]
    F --> G["Final Diagnosis\n(with SHA-256 audit trail)"]

Each agent provides an independent, adversarial check on the previous. No single model can silently propagate a hallucination through all layers.

Why LangGraph?

Typed state (TypedDict) shared across all agents — no message-passing overhead
Checkpointing — workflow can be interrupted and resumed mid-run (critical for human review)
Deterministic routing — graph edges, not LLM-decided next steps
Thread safety — concurrent workflows with fully isolated state

Why Reproducibility Hash?

FDA 21 CFR Part 11 requires electronic records to be auditable. The SHA-256 hash encodes which image was analyzed, what patient context was used, what model versions were active, and what configuration was set at the time of diagnosis. This provides provenance — not exact reproduction (LLMs are stochastic, but the context that produced the output is traceable).

License

This project is for research and educational purposes. Clinical deployment requires regulatory review and validation.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
agents		agents
app		app
assets		assets
db		db
docker		docker
docs		docs
frontend		frontend
graph		graph
mobile_demo		mobile_demo
monitoring		monitoring
retrieval_data		retrieval_data
safety		safety
scripts		scripts
tests		tests
tools		tools
uncertainty		uncertainty
utils		utils
.env.example		.env.example
.gitignore		.gitignore
Chefer_Transformer_Interpretability_Beyond_Attention_Visualization_CVPR_2021_paper.pdf		Chefer_Transformer_Interpretability_Beyond_Attention_Visualization_CVPR_2021_paper.pdf
DATABASE_LOGGING_GUIDE.md		DATABASE_LOGGING_GUIDE.md
DEBATE_SYSTEM_GUIDE.md		DEBATE_SYSTEM_GUIDE.md
DOCTOR_FEEDBACK_AND_CLOUD_DB_GUIDE.md		DOCTOR_FEEDBACK_AND_CLOUD_DB_GUIDE.md
Dockerfile		Dockerfile
FEEDBACK_FLOW_DIAGRAM.md		FEEDBACK_FLOW_DIAGRAM.md
GPU_USAGE_MODES.md		GPU_USAGE_MODES.md
OPTIMIZATION_GUIDE.md		OPTIMIZATION_GUIDE.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
THREAD_SAFETY_GUIDE.md		THREAD_SAFETY_GUIDE.md
VALIDATOR_GUIDE.md		VALIDATOR_GUIDE.md
__init__.py		__init__.py
extract_fhir_to_duckdb.py		extract_fhir_to_duckdb.py
gen.py		gen.py
metrics_snapshot.json		metrics_snapshot.json
monitor_vram.py		monitor_vram.py
package-lock.json		package-lock.json
qlora_medgemma.py		qlora_medgemma.py
requirements.txt		requirements.txt
scratch.py		scratch.py
setup_helper.py		setup_helper.py
train_classifier.py		train_classifier.py
verifai_fhir.duckdb		verifai_fhir.duckdb
verifai_fhir.duckdb.wal		verifai_fhir.duckdb.wal
verifai_fhir.faiss		verifai_fhir.faiss
verifai_fhir_mapping.json		verifai_fhir_mapping.json
verifai_logs.db		verifai_logs.db
verifai_past_mistakes.duckdb		verifai_past_mistakes.duckdb
verifai_past_mistakes.duckdb.wal		verifai_past_mistakes.duckdb.wal

Folders and files

Latest commit

History

Repository files navigation

VERIFAI — Verified Evidence-based Radiology Interpretive Framework with Agentic Intelligence

Table of Contents

Architecture Overview

Agent Pipeline

Pipeline Sequence

Monotonic Uncertainty Cascade (MUC)

Key Features

Core

Safety & Trust

Infrastructure

Project Structure

Prerequisites

Installation

1. Clone the Repository

2. Set Up Conda Environment

3. Install PyTorch with CUDA

4. Install Python Dependencies

5. Install NLTK Data

6. Install RadGraph Model (for NLP validation)

7. Install Frontend

Environment Configuration

Running the System

Quick Start — Full Pipeline Test (No Frontend)

Start Full Stack (Backend + Frontend)

API Only (cURL)

Models Used

Building the FHIR + FAISS Retrieval Index

Step 1: Extract FHIR Bundles to DuckDB

Step 2: Build FAISS Index

Step 3: (Optional) Seed Past Mistakes Database

Frontend Dashboard

Pages

Results Page Tabs

Human-in-the-Loop Panel

Observability & Monitoring

Database Logging

Testing

API Reference

Example: Start Workflow

Example: Status Response

Configuration Reference

Models

Workflow

API Keys

Technical Design Decisions

Why Multi-Agent (Not a Single Model)?

Why LangGraph?

Why Reproducibility Hash?

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages