Clinical Trial Mining

Motivation

Mining clinical trials is essential for accelerating drug discovery and biomedical research. ClinicalTrials.gov contains a wealth of information on ongoing and completed studies, including interventions, conditions, and outcomes. Systematic extraction and integration of this data enables:

Mapping of drug–disease relationships
Identification of drug repurposing opportunities
Analysis of intervention efficacy and safety

Project Overview

This project provides tools to fetch, process, and annotate clinical trial data directly from the AACT (Aggregate Analysis of ClinicalTrials.gov) database. It is designed to facilitate large-scale mining and integration of clinical trials for drug discovery applications.

Key Features

Direct Connection to AACT: Uses a robust connector to securely access the AACT PostgreSQL database.
Automated Table Loading: Loads and filters relevant tables (studies, interventions, conditions, etc.) using PolarsS for scalable processing.
Drug and Disease Mapping: Integrates external drug and disease vocabularies to annotate interventions and indications in trials.
Output: Produces harmonized datasets suitable for downstream analytics, including drug–disease mappings.

Data Sources

AACT Database

AACT database is a PostgreSQL database containing clinical trial data from the ClinicalTrials.gov database. We use Polars to connect to the database and return queried data in a DataFrame format.Credentials and connection parameters are provided via configuration.
ChEMBL Drug Indication Data

JSON file storing indications for drugs, and clinical candidate drugs, from a variety of sources (e.g., FDA, EMA, WHO ATC, ClinicalTrials.gov, INN, USAN).
ChEMBL Clinical Trials Pipeline

The private DRUGBASE_CURATION database in ChEMBL stores metadata related to clinical trials. After processing data from ClinicalTrials.gov, their internal pipeline automatically assigns an EFO ID to each condition and a ChEMBL ID to each intervention mentioned in the trials. This database is used to map the conditions and interventions in the AACT database to ChEMBL and EFO IDs, and it requires a valid ChEMBL user account to access.
PMDA Database

The PMDA (Pharmaceuticals and Medical Devices Agency) is the Japanese regulatory agency. We use their PDF of approved products to extract drug/disease associations. The PDF can be downloaded from their site: https://www.pmda.go.jp/english/review-services/reviews/approved-information/drugs/0002.html

Usage

1. Configure Your Environment

Database Credentials: Open src/clinical_mining/config.yaml and fill in your AACT database credentials under the db_properties section.
File Paths: Ensure the paths under the datasets section point to the correct locations for your input data files.

2. Run the Pipeline

Execute the main script from the root directory of the project:

uv run clinical_mining

You can override configuration parameters from the command line if needed:

uv run clinical_mining db_properties.user=<your_user> db_properties.password=<your_password>

Pipeline Configuration

This pipeline is designed to be config-driven, allowing you to rearrange the steps to produce different outputs without changing the core Python code. The entire workflow is defined in src/clinical_mining/config.yaml.

Pipeline Structure (DAG)

The pipeline is structured as a Directed Acyclic Graph (DAG) with three main stages, executed in order:

setup: Steps used to prepare any data needed for later processes, such as loading mapping tables or performing initial filtering on large datasets.
union: Steps that generate the primary drug/indication DataFrames. The outputs of all steps in this section are combined into a single DataFrame.
process: Final transformation steps that run sequentially on the unified DataFrame.

The final, annotated dataset (named final_df in the config) is saved in Parquet format for further analysis.

Configuring a Pipeline Step

Each step within a stage is a dictionary with the following keys:

name: A unique name for the step. The output of this step will be stored and can be referenced by this name in later steps.
function: The full Python path to the function you want to execute.
parameters: A dictionary of arguments to pass to the function.
- Reference other DataFrames: To pass the output of a previous step or an initial input source as a parameter, use a $ prefix.
- Literal values: Any value without a $ prefix is treated as a literal.

The pipeline can be customised by editing the pipeline section of config.yaml. You can reorder steps, add new steps that call existing functions, or change parameters.

Contribute

If you need to add a new data transformation or a new input source, please submit a pull request with the new functionality. Once merged, your function can be integrated into the pipeline via the configuration file.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
src/clinical_mining		src/clinical_mining
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical Trial Mining

Motivation

Project Overview

Key Features

Data Sources

Usage

1. Configure Your Environment

2. Run the Pipeline

Pipeline Configuration

Pipeline Structure (DAG)

Configuring a Pipeline Step

Contribute

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

opentargets/clinical_mining

Folders and files

Latest commit

History

Repository files navigation

Clinical Trial Mining

Motivation

Project Overview

Key Features

Data Sources

Usage

1. Configure Your Environment

2. Run the Pipeline

Pipeline Configuration

Pipeline Structure (DAG)

Configuring a Pipeline Step

Contribute

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages