Skip to content

inventcures/fda_crl_analysis

Repository files navigation

FDA Complete Response Letter (CRL) Analysis Pipeline

Ask DeepWiki

Extract insights from FDA rejection patterns to accelerate drug development

This toolkit downloads, parses, and analyzes FDA Complete Response Letters from the OpenFDA database to reveal patterns in regulatory decision-making that can help drug developers avoid common pitfalls and increase approval probability.

🏆 Top 3 High-Impact Analyses

Based on impact to the drug discovery community and ability to reveal FDA decision-making patterns:


#1: Approved vs. Unapproved Discriminative Feature Analysis

Impact: ⭐⭐⭐⭐⭐ (Highest)

Why this matters:

  • This is the holy grail question: What separates drugs that recover from CRLs vs. those that don't?
  • Directly actionable for drug developers designing trials and regulatory strategies
  • Can reveal "hidden rules" in FDA decision-making that aren't in official guidance

What it reveals:

  • Which deficiency types are "death sentences" vs. "speed bumps"
  • Whether certain combinations of issues are particularly problematic
  • If application type (NDA vs. BLA) or therapeutic area influences recovery
  • The threshold of deficiency count/severity that predicts failure

Key outputs:

  • Feature importance rankings (Random Forest, XGBoost)
  • ROC curves showing prediction accuracy
  • SHAP values for interpretability
  • Risk score calculator for new applications

Example insight: "CRLs citing both safety concerns AND requests for new clinical trials have only 12% rescue rate vs. 78% for manufacturing-only issues"

from src.analysis import CRLAnalyzer

analyzer = CRLAnalyzer(data_path="data/processed/parsed_crls.json")
results = analyzer.build_approval_classifier()
analyzer.plot_feature_importance(results, save_path="outputs/feature_importance.png")

#2: Rescue Rate by Deficiency Category

Impact: ⭐⭐⭐⭐⭐ (Highest)

Why this matters:

  • Quantifies the "recoverability" of each deficiency type
  • Essential for portfolio risk management and investment decisions
  • Helps sponsors prioritize resources on fixable vs. terminal issues

What it reveals:

  • Manufacturing issues → High rescue rate (often fixable)
  • Fundamental efficacy failures → Low rescue rate (rarely recoverable)
  • Which combination patterns have synergistic negative effects
  • Time-to-resolution by deficiency type

Key outputs:

  • Rescue rate bar chart by category
  • Kaplan-Meier survival curves for time-to-approval
  • Sankey diagram showing CRL → outcome flow
  • Risk matrix (deficiency type × severity → rescue probability)

Example insight: "CMC/manufacturing deficiencies have 87% rescue rate with median 8-month resolution, while 'failed to demonstrate efficacy' has 23% rescue rate with median 2.5-year resolution"

rescue_rates = analyzer.calculate_rescue_rates()
analyzer.plot_rescue_rates(save_path="outputs/rescue_rates.png")

#3: LLM-Powered Deficiency Extraction & Severity Scoring

Impact: ⭐⭐⭐⭐ (Very High)

Why this matters:

  • CRLs contain nuanced, unstructured language that keyword matching misses
  • LLMs can extract FDA's implicit severity signals from phrasing
  • Enables semantic search across the corpus ("find all CRLs with hepatotoxicity concerns")
  • Creates structured database from free-text for downstream ML

What it reveals:

  • Granular deficiency taxonomy beyond top-level categories
  • FDA's "tone" and implicit severity (confident rejection vs. request for clarification)
  • Specific remediation recommendations and their implied complexity
  • Patterns in FDA reviewer language over time

Key outputs:

  • Structured deficiency database with severity scores
  • Semantic embeddings for similarity search
  • Automatic classification of resubmission class (I vs. II)
  • Remediation complexity estimates

Example insight: "CRLs using language like 'cannot determine' have 34% lower rescue rate than those with 'additional data needed', suggesting FDA confidence level is predictive"

from src.llm_analysis import LLMAnalyzer

llm = LLMAnalyzer(api_key="your-key")
extractions = llm.analyze_batch(documents, output_path="outputs/llm_extractions.json")

# Comparative analysis
comparison = llm.comparative_analysis(approved_extractions, unapproved_extractions)

Quick Start

Installation

# Clone/download project
cd fda_crl_analysis

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

Run Full Pipeline

# Download, parse, analyze, and run language analysis
python main.py --download --parse --analyze --language

# With LLM analysis (requires API key)
export ANTHROPIC_API_KEY="your-key"
python main.py --download --parse --llm-analyze --analyze --language

Run Individual Stages

# Download only
python main.py --download

# Parse PDFs (requires downloaded data)
python main.py --parse --limit 50  # Test with 50 docs

# Parse with raw text (needed for language analysis)
python main.py --parse --include-raw-text

# Analysis only (requires parsed data)
python main.py --analyze --parsed-data data/processed/parsed_crls.json

# Language analysis only
python main.py --language --parsed-data data/processed/parsed_crls.json

Project Structure

fda_crl_analysis/
├── main.py                 # Pipeline orchestrator
├── requirements.txt        # Dependencies
├── README.md              # This file
├── CLAUDE.md              # Claude Code implementation guide
├── src/
│   ├── data_acquisition.py # Download from OpenFDA
│   ├── pdf_parser.py       # PDF text extraction & parsing
│   ├── llm_analysis.py     # Claude-based deep analysis
│   ├── analysis.py         # Statistical analysis & visualization
│   └── language_analysis.py # NLP, sentiment & latent space viz
├── scripts/
│   ├── generate_embeddings.py # Generate vector embeddings for search
│   └── generate_highlights.py # Generate PDF highlight annotations
├── data/
│   ├── raw/
│   │   ├── approved_crls/  # Downloaded approved CRL PDFs
│   │   └── unapproved_crls/ # Downloaded unapproved CRL PDFs
│   └── processed/
│       └── parsed_crls.json # Structured extracted data
├── outputs/
│   ├── deficiency_frequency.png
│   ├── feature_importance.png
│   ├── roc_curves.png
│   ├── rescue_rates.png
│   ├── cooccurrence_heatmap.png
│   ├── statistical_comparison.png
│   ├── analysis_summary.json
│   └── language/           # Language analysis outputs
│       ├── wordcloud_comparison.png
│       ├── wordcloud_severity.png
│       ├── tsne_embeddings.png
│       ├── umap_embeddings.png
│       ├── cluster_analysis.png
│       ├── topic_model.png
│       └── severity_landscape.png
├── website/                # Interactive Next.js website
│   ├── app/               # Next.js app pages
│   ├── components/        # React components
│   ├── lib/               # Search & embedding logic
│   └── public/data/       # Static data & embeddings
└── notebooks/
    └── exploratory_analysis.ipynb

All 10 Analysis Ideas

# Analysis Module Impact
1 Discriminative Features (Approved vs. Unapproved) analysis.py ⭐⭐⭐⭐⭐
2 Rescue Rate by Deficiency Category analysis.py ⭐⭐⭐⭐⭐
3 LLM Severity & Taxonomy Extraction llm_analysis.py ⭐⭐⭐⭐
4 Deficiency Co-occurrence Patterns analysis.py ⭐⭐⭐⭐
5 Time-to-Resolution Analysis analysis.py ⭐⭐⭐⭐
6 Therapeutic Area Risk Profiling analysis.py ⭐⭐⭐
7 Manufacturing Facility Network Analysis TBD ⭐⭐⭐
8 Sponsor Communication Gap Analysis TBD ⭐⭐⭐
9 Temporal Trends & Policy Impact analysis.py ⭐⭐⭐
10 Predictive "Red Flag" Tool analysis.py ⭐⭐⭐⭐

Key Visualizations

Statistical Analysis (analysis.py)

  1. Feature Importance - Which CRL characteristics predict approval
  2. ROC Curves - Model performance comparison
  3. Rescue Rate Chart - Recovery probability by deficiency type
  4. Co-occurrence Heatmap - Deficiency category relationships
  5. Statistical Comparison - Distributions with significance tests

Language & Sentiment Analysis (language_analysis.py)

  1. Comparative Word Clouds - Approved vs. unapproved language
  2. Severity-Colored Word Cloud - Terms colored by FDA severity score
  3. N-gram Comparison - Top bigrams/trigrams by outcome
  4. Severity Distribution - Histogram + boxplot by approval status
  5. Action Type Radar - FDA requested actions visualization
  6. Sentiment Trajectory - Sentiment flow through a document
  7. t-SNE Embeddings - Documents in latent space (by approval)
  8. UMAP Embeddings - Alternative dimensionality reduction
  9. Cluster Analysis - K-means with topic labels
  10. LDA Topic Model - Topic-word distributions
  11. Severity Landscape - Latent space colored by severity

API Reference

CRLAnalyzer

from src.analysis import CRLAnalyzer

analyzer = CRLAnalyzer(data_path="path/to/parsed_crls.json")

# Run all analyses
summary = analyzer.generate_full_analysis(output_dir="outputs/")

# Individual analyses
freq = analyzer.deficiency_frequency_analysis()
rates = analyzer.calculate_rescue_rates()
classifier = analyzer.build_approval_classifier()
stats = analyzer.statistical_tests()

LLMAnalyzer

from src.llm_analysis import LLMAnalyzer

llm = LLMAnalyzer(api_key="sk-ant-...", model="claude-sonnet-4-20250514")

# Single document
extraction = llm.analyze_single_crl(crl_text, file_hash)

# Batch processing
extractions = llm.analyze_batch(documents, output_path="extractions.json")

# Comparative analysis
comparison = llm.comparative_analysis(approved_list, unapproved_list)

Language Analysis

from src.language_analysis import (
    CRLLanguageAnalysisSuite,
    FDASentimentAnalyzer,
    CRLTextVisualizer,
    CRLLatentSpaceVisualizer
)

# Full analysis suite
suite = CRLLanguageAnalysisSuite()
results = suite.run_full_analysis(documents, output_dir="outputs/language")

# Individual sentiment analysis
analyzer = FDASentimentAnalyzer()
severity = analyzer.calculate_severity_score(text)
certainty = analyzer.calculate_certainty_score(text)
actions = analyzer.extract_action_types(text)

# Text visualizations
text_viz = CRLTextVisualizer()
text_viz.plot_comparative_wordcloud(approved_texts, unapproved_texts, save_path="wordcloud.png")
text_viz.plot_severity_distribution(documents, save_path="severity.png")
text_viz.plot_action_type_radar(documents, save_path="radar.png")

# Latent space visualizations
latent_viz = CRLLatentSpaceVisualizer()
latent_viz.plot_tsne_embeddings(documents, save_path="tsne.png")
latent_viz.plot_umap_embeddings(documents, save_path="umap.png")
latent_viz.plot_cluster_analysis(documents, n_clusters=5, save_path="clusters.png")
latent_viz.plot_topic_model(documents, n_topics=5, save_path="topics.png")

Data Sources

Limitations

  1. Redactions: CRLs are heavily redacted for trade secrets/CCI, limiting some analyses
  2. Sample Size: ~200 approved + ~89 unapproved CRLs may limit statistical power
  3. Selection Bias: Unapproved CRLs are recent (2024-2025); approved span 2020-2024
  4. OCR Quality: Some older PDFs may have extraction errors
  5. Missing Context: CRLs don't include sponsor responses or negotiation history

Interactive Website

The project includes a fully interactive website for exploring CRL data:

Live Demo: https://fda-crl-analysis.vercel.app

Website Features

  • Hybrid Search: BM25 + Vector semantic search across all 297 CRLs
    • Keyword mode for exact matches (drug names, application numbers)
    • Semantic mode for conceptual queries ("manufacturing issues", "safety concerns")
    • Hybrid mode combines both using Reciprocal Rank Fusion
  • Inline PDF Viewer: View CRLs with highlight annotations
  • Interactive Dashboards: Overview, deficiencies, language analysis, predictive models
  • Fully Offline: Uses transformers.js for client-side embeddings (no API calls)

Running Locally

cd website
npm install
npm run dev
# Open http://localhost:3000

See website/README.md for detailed documentation.

Contributing

Areas for contribution:

  • Additional visualization types (Sankey, network graphs)
  • Time series analysis for temporal trends
  • Integration with Drugs@FDA for approval dates
  • Additional ML models (neural networks, survival analysis)

License

MIT License - See LICENSE file

Citation

If you use this toolkit in research, please cite:

FDA CRL Analysis Toolkit (2025)
https://github.com/your-repo/fda-crl-analysis

References

  1. FDA Press Release: "FDA Embraces Radical Transparency by Publishing Complete Response Letters" (July 2025)
  2. Lurie et al. (2015) BMJ Analysis of CRL Disclosure Gaps
  3. OpenFDA Documentation: https://open.fda.gov/apis/transparency/completeresponseletters/

About

Analyze FDA Complete Response Letters to extract drug approval patterns

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors