FDA Complete Response Letter (CRL) Analysis Pipeline

Extract insights from FDA rejection patterns to accelerate drug development

This toolkit downloads, parses, and analyzes FDA Complete Response Letters from the OpenFDA database to reveal patterns in regulatory decision-making that can help drug developers avoid common pitfalls and increase approval probability.

🏆 Top 3 High-Impact Analyses

Based on impact to the drug discovery community and ability to reveal FDA decision-making patterns:

#1: Approved vs. Unapproved Discriminative Feature Analysis

Impact: ⭐⭐⭐⭐⭐ (Highest)

Why this matters:

This is the holy grail question: What separates drugs that recover from CRLs vs. those that don't?
Directly actionable for drug developers designing trials and regulatory strategies
Can reveal "hidden rules" in FDA decision-making that aren't in official guidance

What it reveals:

Which deficiency types are "death sentences" vs. "speed bumps"
Whether certain combinations of issues are particularly problematic
If application type (NDA vs. BLA) or therapeutic area influences recovery
The threshold of deficiency count/severity that predicts failure

Key outputs:

Feature importance rankings (Random Forest, XGBoost)
ROC curves showing prediction accuracy
SHAP values for interpretability
Risk score calculator for new applications

Example insight: "CRLs citing both safety concerns AND requests for new clinical trials have only 12% rescue rate vs. 78% for manufacturing-only issues"

from src.analysis import CRLAnalyzer

analyzer = CRLAnalyzer(data_path="data/processed/parsed_crls.json")
results = analyzer.build_approval_classifier()
analyzer.plot_feature_importance(results, save_path="outputs/feature_importance.png")

#2: Rescue Rate by Deficiency Category

Impact: ⭐⭐⭐⭐⭐ (Highest)

Why this matters:

Quantifies the "recoverability" of each deficiency type
Essential for portfolio risk management and investment decisions
Helps sponsors prioritize resources on fixable vs. terminal issues

What it reveals:

Manufacturing issues → High rescue rate (often fixable)
Fundamental efficacy failures → Low rescue rate (rarely recoverable)
Which combination patterns have synergistic negative effects
Time-to-resolution by deficiency type

Key outputs:

Rescue rate bar chart by category
Kaplan-Meier survival curves for time-to-approval
Sankey diagram showing CRL → outcome flow
Risk matrix (deficiency type × severity → rescue probability)

Example insight: "CMC/manufacturing deficiencies have 87% rescue rate with median 8-month resolution, while 'failed to demonstrate efficacy' has 23% rescue rate with median 2.5-year resolution"

rescue_rates = analyzer.calculate_rescue_rates()
analyzer.plot_rescue_rates(save_path="outputs/rescue_rates.png")

#3: LLM-Powered Deficiency Extraction & Severity Scoring

Impact: ⭐⭐⭐⭐ (Very High)

Why this matters:

CRLs contain nuanced, unstructured language that keyword matching misses
LLMs can extract FDA's implicit severity signals from phrasing
Enables semantic search across the corpus ("find all CRLs with hepatotoxicity concerns")
Creates structured database from free-text for downstream ML

What it reveals:

Granular deficiency taxonomy beyond top-level categories
FDA's "tone" and implicit severity (confident rejection vs. request for clarification)
Specific remediation recommendations and their implied complexity
Patterns in FDA reviewer language over time

Key outputs:

Structured deficiency database with severity scores
Semantic embeddings for similarity search
Automatic classification of resubmission class (I vs. II)
Remediation complexity estimates

Example insight: "CRLs using language like 'cannot determine' have 34% lower rescue rate than those with 'additional data needed', suggesting FDA confidence level is predictive"

from src.llm_analysis import LLMAnalyzer

llm = LLMAnalyzer(api_key="your-key")
extractions = llm.analyze_batch(documents, output_path="outputs/llm_extractions.json")

# Comparative analysis
comparison = llm.comparative_analysis(approved_extractions, unapproved_extractions)

Quick Start

Installation

# Clone/download project
cd fda_crl_analysis

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

Run Full Pipeline

# Download, parse, analyze, and run language analysis
python main.py --download --parse --analyze --language

# With LLM analysis (requires API key)
export ANTHROPIC_API_KEY="your-key"
python main.py --download --parse --llm-analyze --analyze --language

Run Individual Stages

# Download only
python main.py --download

# Parse PDFs (requires downloaded data)
python main.py --parse --limit 50  # Test with 50 docs

# Parse with raw text (needed for language analysis)
python main.py --parse --include-raw-text

# Analysis only (requires parsed data)
python main.py --analyze --parsed-data data/processed/parsed_crls.json

# Language analysis only
python main.py --language --parsed-data data/processed/parsed_crls.json

Project Structure

fda_crl_analysis/
├── main.py                 # Pipeline orchestrator
├── requirements.txt        # Dependencies
├── README.md              # This file
├── CLAUDE.md              # Claude Code implementation guide
├── src/
│   ├── data_acquisition.py # Download from OpenFDA
│   ├── pdf_parser.py       # PDF text extraction & parsing
│   ├── llm_analysis.py     # Claude-based deep analysis
│   ├── analysis.py         # Statistical analysis & visualization
│   └── language_analysis.py # NLP, sentiment & latent space viz
├── scripts/
│   ├── generate_embeddings.py # Generate vector embeddings for search
│   └── generate_highlights.py # Generate PDF highlight annotations
├── data/
│   ├── raw/
│   │   ├── approved_crls/  # Downloaded approved CRL PDFs
│   │   └── unapproved_crls/ # Downloaded unapproved CRL PDFs
│   └── processed/
│       └── parsed_crls.json # Structured extracted data
├── outputs/
│   ├── deficiency_frequency.png
│   ├── feature_importance.png
│   ├── roc_curves.png
│   ├── rescue_rates.png
│   ├── cooccurrence_heatmap.png
│   ├── statistical_comparison.png
│   ├── analysis_summary.json
│   └── language/           # Language analysis outputs
│       ├── wordcloud_comparison.png
│       ├── wordcloud_severity.png
│       ├── tsne_embeddings.png
│       ├── umap_embeddings.png
│       ├── cluster_analysis.png
│       ├── topic_model.png
│       └── severity_landscape.png
├── website/                # Interactive Next.js website
│   ├── app/               # Next.js app pages
│   ├── components/        # React components
│   ├── lib/               # Search & embedding logic
│   └── public/data/       # Static data & embeddings
└── notebooks/
    └── exploratory_analysis.ipynb

All 10 Analysis Ideas

#	Analysis	Module	Impact
1	Discriminative Features (Approved vs. Unapproved)	`analysis.py`	⭐⭐⭐⭐⭐
2	Rescue Rate by Deficiency Category	`analysis.py`	⭐⭐⭐⭐⭐
3	LLM Severity & Taxonomy Extraction	`llm_analysis.py`	⭐⭐⭐⭐
4	Deficiency Co-occurrence Patterns	`analysis.py`	⭐⭐⭐⭐
5	Time-to-Resolution Analysis	`analysis.py`	⭐⭐⭐⭐
6	Therapeutic Area Risk Profiling	`analysis.py`	⭐⭐⭐
7	Manufacturing Facility Network Analysis	TBD	⭐⭐⭐
8	Sponsor Communication Gap Analysis	TBD	⭐⭐⭐
9	Temporal Trends & Policy Impact	`analysis.py`	⭐⭐⭐
10	Predictive "Red Flag" Tool	`analysis.py`	⭐⭐⭐⭐

Key Visualizations

Statistical Analysis (analysis.py)

Feature Importance - Which CRL characteristics predict approval
ROC Curves - Model performance comparison
Rescue Rate Chart - Recovery probability by deficiency type
Co-occurrence Heatmap - Deficiency category relationships
Statistical Comparison - Distributions with significance tests

Language & Sentiment Analysis (language_analysis.py)

Comparative Word Clouds - Approved vs. unapproved language
Severity-Colored Word Cloud - Terms colored by FDA severity score
N-gram Comparison - Top bigrams/trigrams by outcome
Severity Distribution - Histogram + boxplot by approval status
Action Type Radar - FDA requested actions visualization
Sentiment Trajectory - Sentiment flow through a document
t-SNE Embeddings - Documents in latent space (by approval)
UMAP Embeddings - Alternative dimensionality reduction
Cluster Analysis - K-means with topic labels
LDA Topic Model - Topic-word distributions
Severity Landscape - Latent space colored by severity

API Reference

CRLAnalyzer

from src.analysis import CRLAnalyzer

analyzer = CRLAnalyzer(data_path="path/to/parsed_crls.json")

# Run all analyses
summary = analyzer.generate_full_analysis(output_dir="outputs/")

# Individual analyses
freq = analyzer.deficiency_frequency_analysis()
rates = analyzer.calculate_rescue_rates()
classifier = analyzer.build_approval_classifier()
stats = analyzer.statistical_tests()

LLMAnalyzer

from src.llm_analysis import LLMAnalyzer

llm = LLMAnalyzer(api_key="sk-ant-...", model="claude-sonnet-4-20250514")

# Single document
extraction = llm.analyze_single_crl(crl_text, file_hash)

# Batch processing
extractions = llm.analyze_batch(documents, output_path="extractions.json")

# Comparative analysis
comparison = llm.comparative_analysis(approved_list, unapproved_list)

Language Analysis

from src.language_analysis import (
    CRLLanguageAnalysisSuite,
    FDASentimentAnalyzer,
    CRLTextVisualizer,
    CRLLatentSpaceVisualizer
)

# Full analysis suite
suite = CRLLanguageAnalysisSuite()
results = suite.run_full_analysis(documents, output_dir="outputs/language")

# Individual sentiment analysis
analyzer = FDASentimentAnalyzer()
severity = analyzer.calculate_severity_score(text)
certainty = analyzer.calculate_certainty_score(text)
actions = analyzer.extract_action_types(text)

# Text visualizations
text_viz = CRLTextVisualizer()
text_viz.plot_comparative_wordcloud(approved_texts, unapproved_texts, save_path="wordcloud.png")
text_viz.plot_severity_distribution(documents, save_path="severity.png")
text_viz.plot_action_type_radar(documents, save_path="radar.png")

# Latent space visualizations
latent_viz = CRLLatentSpaceVisualizer()
latent_viz.plot_tsne_embeddings(documents, save_path="tsne.png")
latent_viz.plot_umap_embeddings(documents, save_path="umap.png")
latent_viz.plot_cluster_analysis(documents, n_clusters=5, save_path="clusters.png")
latent_viz.plot_topic_model(documents, n_topics=5, save_path="topics.png")

Data Sources

Approved CRLs: https://download.open.fda.gov/approved_CRLs.zip
Unapproved CRLs: https://download.open.fda.gov/unapproved_CRLs.zip
CRL Search Table: https://open.fda.gov/crltable/
OpenFDA API: https://api.fda.gov/other/crl.json

Limitations

Redactions: CRLs are heavily redacted for trade secrets/CCI, limiting some analyses
Sample Size: ~200 approved + ~89 unapproved CRLs may limit statistical power
Selection Bias: Unapproved CRLs are recent (2024-2025); approved span 2020-2024
OCR Quality: Some older PDFs may have extraction errors
Missing Context: CRLs don't include sponsor responses or negotiation history

Interactive Website

The project includes a fully interactive website for exploring CRL data:

Live Demo: https://fda-crl-analysis.vercel.app

Website Features

Hybrid Search: BM25 + Vector semantic search across all 297 CRLs
- Keyword mode for exact matches (drug names, application numbers)
- Semantic mode for conceptual queries ("manufacturing issues", "safety concerns")
- Hybrid mode combines both using Reciprocal Rank Fusion
Inline PDF Viewer: View CRLs with highlight annotations
Interactive Dashboards: Overview, deficiencies, language analysis, predictive models
Fully Offline: Uses transformers.js for client-side embeddings (no API calls)

Running Locally

cd website
npm install
npm run dev
# Open http://localhost:3000

See website/README.md for detailed documentation.

Contributing

Areas for contribution:

Additional visualization types (Sankey, network graphs)
Time series analysis for temporal trends
Integration with Drugs@FDA for approval dates
Additional ML models (neural networks, survival analysis)

License

MIT License - See LICENSE file

Citation

If you use this toolkit in research, please cite:

FDA CRL Analysis Toolkit (2025)
https://github.com/your-repo/fda-crl-analysis

References

FDA Press Release: "FDA Embraces Radical Transparency by Publishing Complete Response Letters" (July 2025)
Lurie et al. (2015) BMJ Analysis of CRL Disclosure Gaps
OpenFDA Documentation: https://open.fda.gov/apis/transparency/completeresponseletters/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FDA Complete Response Letter (CRL) Analysis Pipeline

🏆 Top 3 High-Impact Analyses

#1: Approved vs. Unapproved Discriminative Feature Analysis

#2: Rescue Rate by Deficiency Category

#3: LLM-Powered Deficiency Extraction & Severity Scoring

Quick Start

Installation

Run Full Pipeline

Run Individual Stages

Project Structure

All 10 Analysis Ideas

Key Visualizations

Statistical Analysis (analysis.py)

Language & Sentiment Analysis (language_analysis.py)

API Reference

CRLAnalyzer

LLMAnalyzer

Language Analysis

Data Sources

Limitations

Interactive Website

Website Features

Running Locally

Contributing

License

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
docs		docs
notebooks		notebooks
scripts		scripts
src		src
website		website
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
export_for_web.py		export_for_web.py
gemini_summary_20251227_0803ist.md		gemini_summary_20251227_0803ist.md
main.py		main.py
requirements.txt		requirements.txt
viz_style.py		viz_style.py

Folders and files

Latest commit

History

Repository files navigation

FDA Complete Response Letter (CRL) Analysis Pipeline

🏆 Top 3 High-Impact Analyses

#1: Approved vs. Unapproved Discriminative Feature Analysis

#2: Rescue Rate by Deficiency Category

#3: LLM-Powered Deficiency Extraction & Severity Scoring

Quick Start

Installation

Run Full Pipeline

Run Individual Stages

Project Structure

All 10 Analysis Ideas

Key Visualizations

Statistical Analysis (analysis.py)

Language & Sentiment Analysis (language_analysis.py)

API Reference

CRLAnalyzer

LLMAnalyzer

Language Analysis

Data Sources

Limitations

Interactive Website

Website Features

Running Locally

Contributing

License

Citation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages