Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0a37b89
feat: add MDposit dataset scraping script.
Jan 19, 2026
caf2865
feat(models): add MDPOSIT repository and MDDB project fields.
Jan 19, 2026
9147f32
feat(cli): add README command and scrape-mdposit entry point.
Jan 19, 2026
f809832
merge: sync main into update-mdposit-scraper
Jan 29, 2026
e1a4e9d
refactor(simulation-model): add molecule type field (protein, lipid, …
Jan 29, 2026
fb283e1
chore(ruff): disable PERF401 for model instance appends
Jan 29, 2026
064d94b
refactor(mdposit-scraper): update to scrape using both nodes of MDDB …
Jan 29, 2026
e150d24
docs: adding the mddb documentation + update the readme and command …
Feb 4, 2026
e3c5e38
feat: refactor the code and resolve AttributeError
Feb 4, 2026
cfe2622
merge: sync main into update-mdposit-scraper
Essmaw Feb 5, 2026
5b01789
feat: add URL computation for ExternalIdentifier based on database name
Essmaw Feb 5, 2026
5533d8b
Fix merging of new datasource names into DatasetSourceName instead of…
Essmaw Feb 5, 2026
9ebc838
feat: enhance molecule extraction to fit the new model and adding Un…
Essmaw Feb 5, 2026
96793e5
test(simulation): test URL computation for ExternalIdentifier
Essmaw Feb 5, 2026
f031e28
tests: refactor tests for ExternalIdentifier to account for automatic…
Essmaw Feb 5, 2026
6cb949d
refactor: rename number_of_molecules to number_of_this_molecule_type_…
Essmaw Feb 5, 2026
3871d22
refactor: rename number_of_this_molecule_type_in_system to number_of_…
Essmaw Feb 6, 2026
c9be76f
tests: refactor with `number_of_molecules` attribute and adding speci…
Essmaw Feb 6, 2026
542f54a
fixes(mddb scraper): correct spelling errors, improve parameter descr…
Essmaw Feb 6, 2026
21943fc
docs: correct spelling errors
Essmaw Feb 6, 2026
d826989
fix: Revert to 'software' field
pierrepo Feb 6, 2026
671008c
refactor: Reduce usage and scope of try/except blocks
pierrepo Feb 6, 2026
f987ea7
feat: Add default DatasetSourceName
pierrepo Feb 7, 2026
059d51f
feat: Coexerce verstion to str
pierrepo Feb 7, 2026
ebf4470
docs: Update MDDB documentation and examples
pierrepo Feb 7, 2026
63181fa
refactor: Remove more try/except
pierrepo Feb 7, 2026
7a5f580
refactor: Split log message
pierrepo Feb 7, 2026
d0324ee
fix: Fix error when forcefield metadata is undifiend
pierrepo Feb 7, 2026
8b57c76
fix: Handle case with no protein sequence nor Uniprot identifier
pierrepo Feb 7, 2026
024efa9
fix: Handle case when no software is available
pierrepo Feb 7, 2026
88b9955
feat: Add InChIKey field for Molecule model
pierrepo Feb 7, 2026
dd724a7
fix: Fix dataset_url_in_repository field
pierrepo Feb 7, 2026
9e0374f
docs: Print dataset URL in API
pierrepo Feb 7, 2026
6b959da
feat: Align uniprot identifiers with protein sequences
pierrepo Feb 7, 2026
e3a353c
feat: Add replicas logic in file metadata extraction
pierrepo Feb 7, 2026
7068584
feat: Add rules to avoid lengthy try / except blocks
pierrepo Feb 7, 2026
9cd0a88
fix: Add special case for 'inr' (INRIA) node name
pierrepo Feb 7, 2026
40ea3ca
feat: Add Cineca MDDB node
pierrepo Feb 8, 2026
a8ed77b
feat: Add another way to get protein name from Uniprot
pierrepo Feb 8, 2026
7884275
fix: Update logic to fetch protein name from Uniprot
pierrepo Feb 8, 2026
71f7c43
docs: Fix typos
pierrepo Feb 11, 2026
3d003b3
docs: Relax scraping time
pierrepo Feb 11, 2026
cf32a04
chore: Reallow PERF401 rules
pierrepo Feb 11, 2026
91595f1
docs: Remove MDDB node names
pierrepo Feb 11, 2026
6658973
refactor: Clean code
pierrepo Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,9 @@ When writing code:

When writing functions, always:

- Add descriptive docstrings.
- Add descriptive docstrings
- Use early returns for error conditions
- Limit size of try / except blocks to the strict minimum

Never import libraries by yourself. Always ask before adding dependencies.

Expand Down
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,24 @@ This command will:
4. Validate entries using Pydantic models
5. Save the extracted metadata to Parquet files

## Scrape MDDB

See [MDDB](docs/mddb.md) to understand how with use scrape metadata from MDDB.

Scrape MDDB to collect molecular dynamics (MD) datasets and files:

```bash
uv run scrape-mddb --output-dir data
```

This command will:

1. List all datasets and files through the main MDposit nodes.
2. Parse metadata and validate them using the Pydantic models
`DatasetMetadata` and `FileMetadata`.
3. Save validated files and datasets metadata.

The scraping process takes about 2 hours, depending on your network connection and hardware.

## Analyze Gromacs mdp and gro files

Expand Down
87 changes: 87 additions & 0 deletions docs/mddb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# MDDB

> The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research.

The MDposit infrastructure is distributed across several MDposit nodes. All metadata are accessible through the global node:

MDposit MMB node:

- web site: <https://mdposit.mddbr.eu/>
- documentation: <https://mdposit.mddbr.eu/#/help>
- API: <https://mdposit.mddbr.eu/api/rest/docs/>
- API base URL: <https://mdposit.mddbr.eu/api/rest/v1>

No account / token is needed to access the MDposit API.

## Getting metadata

### Datasets

In MDposit, a dataset (a simulation and its related files) is called a "[project](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)".

API entrypoint to get the total number of projects:

- Endpoint: `/projects/summary`
- HTTP method: GET
- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)

A project can contain multiple replicas, each identified by `project_id`.`replica_id`.

For example, the project [MD-A003ZP](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview) contains ten replicas:

- `MD-A003ZP.1`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/overview
- `MD-A003ZP.2`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.2/overview
- `MD-A003ZP.3`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.3/overview
- ...

API entrypoint to get all datasets at once:

- Endpoint: `/projects`
- HTTP method: GET
- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects)

### Files

API endpoint to get files for a given replica of a project:

- Endpoint: `/projects/{project_id.replica_id}/filenotes`
- HTTP method: GET
- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes)

## Examples

### Project `MD-A003ZP`

Title:

> MDBind 3x1k

Description:

> 10 ns simulation of 1ma4m pdb structure from MDBind dataset, a dynamic view of the PDBBind database

- [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview)
- [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP)

Files for replica 1:

- [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/files)
- [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP.1/filenotes)

### Project `MD-A001T1`

Title:

> All-atom molecular dynamics simulations of SARS-CoV-2 envelope protein E in the monomeric form, C4 popc

Description:

> The trajectories of all-atom MD simulations were obtained based on 4 starting representative conformations from the CG simulation. For each starting structure, there are six trajectories of the E protein: 3 with the protein embedded in the membrane containing POPC, and 3 with the membrane mimicking the natural ERGIC membrane (Mix: 50% POPC, 25% POPE, 10% POPI, 5% POPS, 10% cholesterol).

- [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1/overview)
- [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1)

Files for replica 1:

- [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1.1/files)
- [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1.1/filenotes)
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,4 @@ scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"
scrape-mddb = "mdverse_scrapers.scrapers.mddb:main"
2 changes: 1 addition & 1 deletion src/mdverse_scrapers/models/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ def format_dates(cls, value: datetime | str | None) -> str | None:

Parameters
----------
cls : type[BaseDataset]
cls : type[DatasetMetadata]
The Pydantic model class being validated.
value : datetime | str | None
The input value of the 'date' field to validate.
Expand Down
16 changes: 16 additions & 0 deletions src/mdverse_scrapers/models/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,26 @@ class DatasetSourceName(StrEnum):
ATLAS = "atlas"
GPCRMD = "gpcrmd"
NMRLIPIDS = "nmrlipids"
MDDB = "mddb"
MDPOSIT_INRIA_NODE = "mdposit_inria_node"
MDPOSIT_MMB_NODE = "mdposit_mmb_node"
MDPOSIT_CINECA_NODE = "mdposit_cineca_node"


class ExternalDatabaseName(StrEnum):
"""External database names."""

PDB = "pdb"
UNIPROT = "uniprot"


class MoleculeType(StrEnum):
"""Common molecular types found in molecular dynamics simulations."""

PROTEIN = "protein"
NUCLEIC_ACID = "nucleic_acid"
ION = "ion"
LIPID = "lipid"
CARBOHYDRATE = "carbohydrate"
SOLVENT = "solvent"
SMALL_MOLECULE = "small_molecule"
62 changes: 51 additions & 11 deletions src/mdverse_scrapers/models/simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,16 @@
import re
from typing import Annotated

from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator
from pydantic import (
BaseModel,
ConfigDict,
Field,
StringConstraints,
field_validator,
model_validator,
)

from .enums import ExternalDatabaseName
from .enums import ExternalDatabaseName, MoleculeType

DOI = Annotated[
str,
Expand Down Expand Up @@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel):
None, min_length=1, description="Direct URL to the identifier into the database"
)

@model_validator(mode="after")
def compute_url(self) -> "ExternalIdentifier":
"""Compute the URL for the external identifier.

Parameters
----------
self: ExternalIdentifier
The model instance being validated, with all fields already validated.

Returns
-------
ExternalIdentifier
The model instance with the URL field computed if it was not provided.
"""
if self.url is not None:
return self

if self.database_name == ExternalDatabaseName.PDB:
self.url = f"https://www.rcsb.org/structure/{self.identifier}"
elif self.database_name == ExternalDatabaseName.UNIPROT:
self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}"

return self


class Molecule(BaseModel):
"""Molecule in a simulation."""
Expand All @@ -45,18 +76,25 @@ class Molecule(BaseModel):
model_config = ConfigDict(extra="forbid")

name: str = Field(..., description="Name of the molecule.")
type: MoleculeType | None = Field(
None,
description="Type of the molecule."
"Allowed values in the MoleculeType enum. "
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in description: Line 81 starts with 'Type of the molecule.' but line 82 continues with 'Allowed values' without a space between sentences. There should be a space at the beginning of line 82: ' Allowed values in the MoleculeType enum. '

Suggested change
"Allowed values in the MoleculeType enum. "
" Allowed values in the MoleculeType enum. "

Copilot uses AI. Check for mistakes.
"Examples: PROTEIN, ION, LIPID...",
)
number_of_molecules: int | None = Field(
None,
ge=0,
description="Number of molecules of this type in the simulation.",
)
number_of_atoms: int | None = Field(
None, ge=0, description="Number of atoms in the molecule."
)
formula: str | None = Field(None, description="Chemical formula of the molecule.")
sequence: str | None = Field(
None, description="Sequence of the molecule for protein and nucleic acid."
)
number_of_molecules: int | None = Field(
None,
ge=0,
description="Number of molecules of this type in the simulation.",
)
inchikey: str | None = Field(None, description="InChIKey of the molecule.")
external_identifiers: list[ExternalIdentifier] | None = Field(
None,
description=("List of external database identifiers for this molecule."),
Expand All @@ -66,8 +104,9 @@ class Molecule(BaseModel):
class ForceFieldModel(BaseModel):
"""Forcefield or Model used in a simulation."""

# Ensure scraped metadata matches the expected schema exactly.
model_config = ConfigDict(extra="forbid")
# Ensure scraped metadata matches the expected schema exactly
# and version is coerced to string when needed.
model_config = ConfigDict(extra="forbid", coerce_numbers_to_str=True)

name: str = Field(
...,
Expand All @@ -81,8 +120,9 @@ class ForceFieldModel(BaseModel):
class Software(BaseModel):
"""Simulation software or tool used in a simulation."""

# Ensure scraped metadata matches the expected schema exactly.
model_config = ConfigDict(extra="forbid")
# Ensure scraped metadata matches the expected schema exactly
# and version is coerced to string when needed.
model_config = ConfigDict(extra="forbid", coerce_numbers_to_str=True)

name: str = Field(
...,
Expand Down
Loading