Extract insights from FDA rejection patterns to accelerate drug development
This toolkit downloads, parses, and analyzes FDA Complete Response Letters from the OpenFDA database to reveal patterns in regulatory decision-making that can help drug developers avoid common pitfalls and increase approval probability.
Based on impact to the drug discovery community and ability to reveal FDA decision-making patterns:
Impact: ⭐⭐⭐⭐⭐ (Highest)
Why this matters:
- This is the holy grail question: What separates drugs that recover from CRLs vs. those that don't?
- Directly actionable for drug developers designing trials and regulatory strategies
- Can reveal "hidden rules" in FDA decision-making that aren't in official guidance
What it reveals:
- Which deficiency types are "death sentences" vs. "speed bumps"
- Whether certain combinations of issues are particularly problematic
- If application type (NDA vs. BLA) or therapeutic area influences recovery
- The threshold of deficiency count/severity that predicts failure
Key outputs:
- Feature importance rankings (Random Forest, XGBoost)
- ROC curves showing prediction accuracy
- SHAP values for interpretability
- Risk score calculator for new applications
Example insight: "CRLs citing both safety concerns AND requests for new clinical trials have only 12% rescue rate vs. 78% for manufacturing-only issues"
from src.analysis import CRLAnalyzer
analyzer = CRLAnalyzer(data_path="data/processed/parsed_crls.json")
results = analyzer.build_approval_classifier()
analyzer.plot_feature_importance(results, save_path="outputs/feature_importance.png")Impact: ⭐⭐⭐⭐⭐ (Highest)
Why this matters:
- Quantifies the "recoverability" of each deficiency type
- Essential for portfolio risk management and investment decisions
- Helps sponsors prioritize resources on fixable vs. terminal issues
What it reveals:
- Manufacturing issues → High rescue rate (often fixable)
- Fundamental efficacy failures → Low rescue rate (rarely recoverable)
- Which combination patterns have synergistic negative effects
- Time-to-resolution by deficiency type
Key outputs:
- Rescue rate bar chart by category
- Kaplan-Meier survival curves for time-to-approval
- Sankey diagram showing CRL → outcome flow
- Risk matrix (deficiency type × severity → rescue probability)
Example insight: "CMC/manufacturing deficiencies have 87% rescue rate with median 8-month resolution, while 'failed to demonstrate efficacy' has 23% rescue rate with median 2.5-year resolution"
rescue_rates = analyzer.calculate_rescue_rates()
analyzer.plot_rescue_rates(save_path="outputs/rescue_rates.png")Impact: ⭐⭐⭐⭐ (Very High)
Why this matters:
- CRLs contain nuanced, unstructured language that keyword matching misses
- LLMs can extract FDA's implicit severity signals from phrasing
- Enables semantic search across the corpus ("find all CRLs with hepatotoxicity concerns")
- Creates structured database from free-text for downstream ML
What it reveals:
- Granular deficiency taxonomy beyond top-level categories
- FDA's "tone" and implicit severity (confident rejection vs. request for clarification)
- Specific remediation recommendations and their implied complexity
- Patterns in FDA reviewer language over time
Key outputs:
- Structured deficiency database with severity scores
- Semantic embeddings for similarity search
- Automatic classification of resubmission class (I vs. II)
- Remediation complexity estimates
Example insight: "CRLs using language like 'cannot determine' have 34% lower rescue rate than those with 'additional data needed', suggesting FDA confidence level is predictive"
from src.llm_analysis import LLMAnalyzer
llm = LLMAnalyzer(api_key="your-key")
extractions = llm.analyze_batch(documents, output_path="outputs/llm_extractions.json")
# Comparative analysis
comparison = llm.comparative_analysis(approved_extractions, unapproved_extractions)# Clone/download project
cd fda_crl_analysis
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# Download, parse, analyze, and run language analysis
python main.py --download --parse --analyze --language
# With LLM analysis (requires API key)
export ANTHROPIC_API_KEY="your-key"
python main.py --download --parse --llm-analyze --analyze --language# Download only
python main.py --download
# Parse PDFs (requires downloaded data)
python main.py --parse --limit 50 # Test with 50 docs
# Parse with raw text (needed for language analysis)
python main.py --parse --include-raw-text
# Analysis only (requires parsed data)
python main.py --analyze --parsed-data data/processed/parsed_crls.json
# Language analysis only
python main.py --language --parsed-data data/processed/parsed_crls.jsonfda_crl_analysis/
├── main.py # Pipeline orchestrator
├── requirements.txt # Dependencies
├── README.md # This file
├── CLAUDE.md # Claude Code implementation guide
├── src/
│ ├── data_acquisition.py # Download from OpenFDA
│ ├── pdf_parser.py # PDF text extraction & parsing
│ ├── llm_analysis.py # Claude-based deep analysis
│ ├── analysis.py # Statistical analysis & visualization
│ └── language_analysis.py # NLP, sentiment & latent space viz
├── scripts/
│ ├── generate_embeddings.py # Generate vector embeddings for search
│ └── generate_highlights.py # Generate PDF highlight annotations
├── data/
│ ├── raw/
│ │ ├── approved_crls/ # Downloaded approved CRL PDFs
│ │ └── unapproved_crls/ # Downloaded unapproved CRL PDFs
│ └── processed/
│ └── parsed_crls.json # Structured extracted data
├── outputs/
│ ├── deficiency_frequency.png
│ ├── feature_importance.png
│ ├── roc_curves.png
│ ├── rescue_rates.png
│ ├── cooccurrence_heatmap.png
│ ├── statistical_comparison.png
│ ├── analysis_summary.json
│ └── language/ # Language analysis outputs
│ ├── wordcloud_comparison.png
│ ├── wordcloud_severity.png
│ ├── tsne_embeddings.png
│ ├── umap_embeddings.png
│ ├── cluster_analysis.png
│ ├── topic_model.png
│ └── severity_landscape.png
├── website/ # Interactive Next.js website
│ ├── app/ # Next.js app pages
│ ├── components/ # React components
│ ├── lib/ # Search & embedding logic
│ └── public/data/ # Static data & embeddings
└── notebooks/
└── exploratory_analysis.ipynb
| # | Analysis | Module | Impact |
|---|---|---|---|
| 1 | Discriminative Features (Approved vs. Unapproved) | analysis.py |
⭐⭐⭐⭐⭐ |
| 2 | Rescue Rate by Deficiency Category | analysis.py |
⭐⭐⭐⭐⭐ |
| 3 | LLM Severity & Taxonomy Extraction | llm_analysis.py |
⭐⭐⭐⭐ |
| 4 | Deficiency Co-occurrence Patterns | analysis.py |
⭐⭐⭐⭐ |
| 5 | Time-to-Resolution Analysis | analysis.py |
⭐⭐⭐⭐ |
| 6 | Therapeutic Area Risk Profiling | analysis.py |
⭐⭐⭐ |
| 7 | Manufacturing Facility Network Analysis | TBD | ⭐⭐⭐ |
| 8 | Sponsor Communication Gap Analysis | TBD | ⭐⭐⭐ |
| 9 | Temporal Trends & Policy Impact | analysis.py |
⭐⭐⭐ |
| 10 | Predictive "Red Flag" Tool | analysis.py |
⭐⭐⭐⭐ |
- Feature Importance - Which CRL characteristics predict approval
- ROC Curves - Model performance comparison
- Rescue Rate Chart - Recovery probability by deficiency type
- Co-occurrence Heatmap - Deficiency category relationships
- Statistical Comparison - Distributions with significance tests
- Comparative Word Clouds - Approved vs. unapproved language
- Severity-Colored Word Cloud - Terms colored by FDA severity score
- N-gram Comparison - Top bigrams/trigrams by outcome
- Severity Distribution - Histogram + boxplot by approval status
- Action Type Radar - FDA requested actions visualization
- Sentiment Trajectory - Sentiment flow through a document
- t-SNE Embeddings - Documents in latent space (by approval)
- UMAP Embeddings - Alternative dimensionality reduction
- Cluster Analysis - K-means with topic labels
- LDA Topic Model - Topic-word distributions
- Severity Landscape - Latent space colored by severity
from src.analysis import CRLAnalyzer
analyzer = CRLAnalyzer(data_path="path/to/parsed_crls.json")
# Run all analyses
summary = analyzer.generate_full_analysis(output_dir="outputs/")
# Individual analyses
freq = analyzer.deficiency_frequency_analysis()
rates = analyzer.calculate_rescue_rates()
classifier = analyzer.build_approval_classifier()
stats = analyzer.statistical_tests()from src.llm_analysis import LLMAnalyzer
llm = LLMAnalyzer(api_key="sk-ant-...", model="claude-sonnet-4-20250514")
# Single document
extraction = llm.analyze_single_crl(crl_text, file_hash)
# Batch processing
extractions = llm.analyze_batch(documents, output_path="extractions.json")
# Comparative analysis
comparison = llm.comparative_analysis(approved_list, unapproved_list)from src.language_analysis import (
CRLLanguageAnalysisSuite,
FDASentimentAnalyzer,
CRLTextVisualizer,
CRLLatentSpaceVisualizer
)
# Full analysis suite
suite = CRLLanguageAnalysisSuite()
results = suite.run_full_analysis(documents, output_dir="outputs/language")
# Individual sentiment analysis
analyzer = FDASentimentAnalyzer()
severity = analyzer.calculate_severity_score(text)
certainty = analyzer.calculate_certainty_score(text)
actions = analyzer.extract_action_types(text)
# Text visualizations
text_viz = CRLTextVisualizer()
text_viz.plot_comparative_wordcloud(approved_texts, unapproved_texts, save_path="wordcloud.png")
text_viz.plot_severity_distribution(documents, save_path="severity.png")
text_viz.plot_action_type_radar(documents, save_path="radar.png")
# Latent space visualizations
latent_viz = CRLLatentSpaceVisualizer()
latent_viz.plot_tsne_embeddings(documents, save_path="tsne.png")
latent_viz.plot_umap_embeddings(documents, save_path="umap.png")
latent_viz.plot_cluster_analysis(documents, n_clusters=5, save_path="clusters.png")
latent_viz.plot_topic_model(documents, n_topics=5, save_path="topics.png")- Approved CRLs: https://download.open.fda.gov/approved_CRLs.zip
- Unapproved CRLs: https://download.open.fda.gov/unapproved_CRLs.zip
- CRL Search Table: https://open.fda.gov/crltable/
- OpenFDA API: https://api.fda.gov/other/crl.json
- Redactions: CRLs are heavily redacted for trade secrets/CCI, limiting some analyses
- Sample Size: ~200 approved + ~89 unapproved CRLs may limit statistical power
- Selection Bias: Unapproved CRLs are recent (2024-2025); approved span 2020-2024
- OCR Quality: Some older PDFs may have extraction errors
- Missing Context: CRLs don't include sponsor responses or negotiation history
The project includes a fully interactive website for exploring CRL data:
Live Demo: https://fda-crl-analysis.vercel.app
- Hybrid Search: BM25 + Vector semantic search across all 297 CRLs
- Keyword mode for exact matches (drug names, application numbers)
- Semantic mode for conceptual queries ("manufacturing issues", "safety concerns")
- Hybrid mode combines both using Reciprocal Rank Fusion
- Inline PDF Viewer: View CRLs with highlight annotations
- Interactive Dashboards: Overview, deficiencies, language analysis, predictive models
- Fully Offline: Uses transformers.js for client-side embeddings (no API calls)
cd website
npm install
npm run dev
# Open http://localhost:3000See website/README.md for detailed documentation.
Areas for contribution:
- Additional visualization types (Sankey, network graphs)
- Time series analysis for temporal trends
- Integration with Drugs@FDA for approval dates
- Additional ML models (neural networks, survival analysis)
MIT License - See LICENSE file
If you use this toolkit in research, please cite:
FDA CRL Analysis Toolkit (2025)
https://github.com/your-repo/fda-crl-analysis
- FDA Press Release: "FDA Embraces Radical Transparency by Publishing Complete Response Letters" (July 2025)
- Lurie et al. (2015) BMJ Analysis of CRL Disclosure Gaps
- OpenFDA Documentation: https://open.fda.gov/apis/transparency/completeresponseletters/