OpenADMET · mariacm12 · Mar 25, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 20, 2026
diff --git a/.github/workflows/NB_CI.yaml b/.github/workflows/NB_CI.yaml
@@ -0,0 +1,52 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+  schedule:
+    - cron: '30 5 * * 1'
+  workflow_dispatch:
+
+jobs:
+  test:
+    name: Test on ${{ matrix.os }}, Python ${{ matrix.python-version }}
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [macOS-latest, ubuntu-latest]
+        python-version: ["3.12"]
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Additional info about the build
+        shell: bash
+        run: |
+          uname -a
+          df -h
+          ulimit -a
+
+      - uses: mamba-org/setup-micromamba@v1
+        with:
+          environment-file: environment.yaml
+          environment-name: oadmet_pxr_tutorial
+          condarc: |
+            channels:
+              - conda-forge
+          create-args: >-
+            python=${{ matrix.python-version }}
+
+      - name: Install notebook test dependencies
+        shell: bash -l {0}
+        run: |
+          python -m pip install nbmake pytest-xdist
+
+      - name: Run notebook tests
+        shell: bash -l {0}
+        run: |
+          pytest -n=auto --nbmake --nbmake-timeout=1200 --maxfail=0 --disable-warnings notebooks/
diff --git a/README.md b/README.md
@@ -1,2 +1,87 @@
 # PXR-Challenge-Tutorial
-Tutorial for the OpenADMET-PXR blind challenge
+[![Logo](https://img.shields.io/badge/OSMF-OpenADMET-%23002f4a)](https://openadmet.org/)
+
+This repo provides a guide and example workflows to participate in the [**OpenADMET - PXR Blind Challenge**](https://huggingface.co/spaces/openadmet/pxr-challenge).
+
+Following the success of our previous ExpansionRx challenge, this new community-driven initiative focuses on benchmarking models for predicting **PXR (Pregnane X Receptor) induction**. Evaluating PXR liabilities is a fundamental pillar of a late-stage ADMET cascade. PXR is a notoriously difficult "anti-target" to model due to its unusually large and flexible binding pocket, making it a perfect test case for evaluating model generalization in real-world drug discovery.
+
+We provide dedicated starter notebooks for each track to help you build your baseline models and format your submissions:
+
+* [**Activity Track Tutorial**](./notebooks/activity_prediction.ipynb): Guide for training regression models on pEC50 data.
+* [**Structure Track Tutorial**](./notebooks/structure_prediction.ipynb): Guide for structural modeling and docking-based approaches.
+
+We have a dedicated [Discord server](https://discord.gg/4mERqNsQh7) for Q&A, discussion, and support during the challenge. Our evaluation logic is also open source and available on in this repo. We welcome feedback and community discussion on all aspects the challenge!
+
+---
+
+## 📦 Dataset
+
+### The Activity Dataset 
+
+  Available on [Hugging Face](https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-train-data)  
+  Includes SMILES and ADMET measurements for a series of molecules.
+
+  At Octant, OpenADMET has generated a PXR induction dataset of more than 11,000 compounds using a low-cost, high-fidelity in-house assay. Compounds were sourced primarily from two Enamine libraries (Discovery Diversity 10 set and FDA Approved Drugs set) along with subsequent orders of follow-on compounds, and profiled through a rigorous multi-step assay flow reminiscent of an on-target drug discovery program.
+
+The dataset was built through the following stages:
+
+- Primary Screen: 11,362 diverse compounds screened at a single concentration.
+- Dose-Response: 4,779 compounds selected for an 8-concentration dose-response (with extensive counter-screening in a PXR-null cell line to evaluate specificity).
+- Refinement: 114 compounds showed EC50 ≤ 1 µM (pEC50 ≥ 6).
+- Counter-Screen: 63 compounds selected based on minimal activity in a PXR-null cell line to confirm on-target specificity.
+- Analog Expansion Set: Similarity searches (ECFP4 Tanimoto > 0.4) of these 63 actives yielded the 513-compound test set, ordered from the Enamine US on-demand catalog and fully assayed with dose-response curves.
+
+This design mimics a lead optimization scenario, shifting from broad hit-finding to detailed exploration of Structure-Activity Relationships (SAR). The analog set contains detailed SAR and activity cliffs that should prove challenging for models. Cumulatively, this represents the largest PXR activity dataset available in the literature.
+
+### The Structure Dataset
+
+PXR's large, flexible binding pocket is highly dynamic and capable of recognising ligands of vastly different sizes and shapes — a structural plasticity that represents a significant challenge for structure-based design methodologies.
+
+The structure dataset comprises 110 fragment-sized small molecules for which X-ray crystal structures have been determined at UCSF (Fraser Lab) but remain blinded until the challenge concludes. Fragments were drawn from the DSI-poised library, Enamine Essential fragments library, and an in-house UCSF library. Fragments were soaked into apo crystals in the P2₁2₁2₁ crystal form at a nominal concentration of 10 mM. Data were collected at NSLS-II using the AMX and FMX beamlines. Data were reduced using Autoproc; electron density maps were analysed for fragment binding events using PanDDA. Ligands were modelled in COOT using coordinates and restraints generated by phenix.elbow, and models were refined with phenix.refine.
+
+
+In addition, 68 structures from the PDB have been re-refined and will be released as part of the training data package.
+
+Blinded — predictions must be submitted to the challenge platform. Blinded test data also available on [HuggingFace](https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-test-data-blinded)
+
+---
+
+## 🧪 The Challenge Tracks
+Participants can compete in either or both tracks.
+
+1. Activity Prediction Track
+Participants predict pEC50 values for the 513-compound analog set. An extensive data package will be provided for the training set, including PXR pEC50 and Emax, null-line pEC50 and Emax, and supporting raw data.
+
+The track proceeds in two phases:
+
+Phase 1: Predict activity for all 513 compounds. Analog Set 1 will serve as a live leaderboard during this period.
+Phase 2: EC50 values for Analog Set 1 are unblinded; participants then refine predictions for the remaining Analog Set 2. There is no live leaderboard for Phase 2 — predictions are fully blinded until the deadline.
+The primary evaluation metric is RAE (Relative Absolute Error) on pEC50. Extensive secondary metrics (MAE, R², Spearman ρ, Kendall's τ) and error estimation via bootstrapping will also be reported.
+
+2. Structure Prediction Track
+Participants predict the bound protein-ligand complex for each of the 110 fragment ligands, given their SMILES strings. Participants may use protein structure prediction tools or existing PDB structures.
+
+A live leaderboard is maintained using half of the 110 structures; the remaining half are held out and only scored at the final deadline.
+
+The primary evaluation metric is LDDT-PLI (Local Distance Difference Test for Protein-Ligand Interactions). BiSyRMSD and LDDT-LP are also reported as secondary metrics.
+
+---
+
+## ✅ How to Participate
+
+1. **Register:** Create an account with Hugging Face.
+2. **Download the Data:** Training and test sets are released April 1.
+3. **Walk through the tutorial** via this repo
+4. **Join the Community:** Get support and coordinate in the #pxr-challenge channel on our [Discord](https://discord.gg/4mERqNsQh7).
+5. **Build and Refine:** Use the training data to build your models. In Phase 2, use the unblinded Analog Set 1 results to refine predictions for Analog Set 2.
+6. **Submit:** Submissions open April 1 via the Submit tab on the [challenge platform](https://huggingface.co/spaces/openadmet/pxr-challenge)
+
+---
+
+## 📅 Key Dates
+
+- 🗓 **Training/Test sets released**: April 1
+- 📊 **Phase 1 concludes and interim Activity leaderboard**: May 25
+- 👁️ **Analog Set 1 unblinded**: May 26
+- ⏳ **Submission Deadline for all tracks**: July 1  
+
diff --git a/environment.yaml b/environment.yaml
@@ -0,0 +1,21 @@
+name: oadmet_pxr_tutorial
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - python
+  - rdkit
+  - pandas
+  - scikit-learn
+  - matplotlib
+  - seaborn
+  - jupyterlab
+  - numpy
+  - lightgbm
+  - tqdm
+  - fsspec
+  - datasets
+  - scipy
+  - biotite
+  - pip:
+      - git+https://github.com/PatWalters/useful_rdkit_utils.git@master
diff --git a/inputs/PXR_protein_sequence.fasta b/inputs/PXR_protein_sequence.fasta
@@ -0,0 +1,2 @@
+>PXR_chain_A
+GLTEEQRMMIRELMDAQMKTFDTTFSHFKNFRLPGVLSSGCELPESLQAPSREEAAKWSQVRKDLCSLKVSLQLRGEDGSVWNYKPPADSGGKEIFSLLPHMADMSTYMFKGIISFAKVISYFRDLPIEDQISLLKGAAFELCQLRFNTVFNAETGTWECGRLSYCLEDTAGGFQQLLLEPMLKFHYMLKKLQLHEEEYVLMQAISLFSPDRPGVLQHRVVDQLQEQFAITLKSYIECNRPQPAHRFLFLKIMAMLTELRSINAQHTQRLLRIQDIHPFATPLMQELFGITGS
diff --git a/inputs/pxr_x01378-1.yaml b/inputs/pxr_x01378-1.yaml
@@ -0,0 +1,6 @@
+version: 1
+sequences:
+- protein: {id: A, sequence: GLTEEQRMMIRELMDAQMKTFDTTFSHFKNFRLPGVLSSGCELPESLQAPSREEAAKWSQVRKDLCSLKVSLQLRGEDGSVWNYKPPADSGGKEIFSLLPHMADMSTYMFKGIISFAKVISYFRDLPIEDQISLLKGAAFELCQLRFNTVFNAETGTWECGRLSYCLEDTAGGFQQLLLEPMLKFHYMLKKLQLHEEEYVLMQAISLFSPDRPGVLQHRVVDQLQEQFAITLKSYIECNRPQPAHRFLFLKIMAMLTELRSINAQHTQRLLRIQDIHPFATPLMQELFGITGS}
+- ligand: {id: B, smiles: CCC=1C=CC(=CC1)C=2N=C(N)SC2C}
+properties:
+- affinity: {binder: B}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		>PXR_chain_A
		GLTEEQRMMIRELMDAQMKTFDTTFSHFKNFRLPGVLSSGCELPESLQAPSREEAAKWSQVRKDLCSLKVSLQLRGEDGSVWNYKPPADSGGKEIFSLLPHMADMSTYMFKGIISFAKVISYFRDLPIEDQISLLKGAAFELCQLRFNTVFNAETGTWECGRLSYCLEDTAGGFQQLLLEPMLKFHYMLKKLQLHEEEYVLMQAISLFSPDRPGVLQHRVVDQLQEQFAITLKSYIECNRPQPAHRFLFLKIMAMLTELRSINAQHTQRLLRIQDIHPFATPLMQELFGITGS