MDverse · pierrepo · Feb 11, 2026 · Jan 19, 2026 · Jan 19, 2026 · Jan 19, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -24,8 +24,9 @@ When writing code:
 
 When writing functions, always:
 
-- Add descriptive docstrings.
+- Add descriptive docstrings
 - Use early returns for error conditions
+- Limit size of try / except blocks to the strict minimum
 
 Never import libraries by yourself. Always ask before adding dependencies.
 

diff --git a/README.md b/README.md
@@ -170,6 +170,24 @@ This command will:
 4. Validate entries using Pydantic models
 5. Save the extracted metadata to Parquet files
 
+## Scrape MDDB
+
+See [MDDB](docs/mddb.md) to understand how with use scrape metadata from MDDB.
+
+Scrape MDDB to collect molecular dynamics (MD) datasets and files:
+
+```bash
+uv run scrape-mddb --output-dir data
+```
+
+This command will:
+
+1. List all datasets and files through the main MDposit nodes.
+2. Parse metadata and validate them using the Pydantic models
+   `DatasetMetadata` and `FileMetadata`.
+3. Save validated files and datasets metadata.
+
+The scraping process takes about 2 hours, depending on your network connection and hardware.
 
 ## Analyze Gromacs mdp and gro files
 

diff --git a/docs/mddb.md b/docs/mddb.md
@@ -0,0 +1,87 @@
+# MDDB
+
+> The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research.
+
+The MDposit infrastructure is distributed across several MDposit nodes. All metadata are accessible through the global node:
+
+MDposit MMB node:
+
+- web site: <https://mdposit.mddbr.eu/>
+- documentation: <https://mdposit.mddbr.eu/#/help>
+- API: <https://mdposit.mddbr.eu/api/rest/docs/>
+- API base URL: <https://mdposit.mddbr.eu/api/rest/v1>
+
+No account / token is needed to access the MDposit API.
+
+## Getting metadata
+
+### Datasets
+
+In MDposit, a dataset (a simulation and its related files) is called a "[project](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)".
+
+API entrypoint to get the total number of projects:
+
+- Endpoint: `/projects/summary`
+- HTTP method: GET
+- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)
+
+A project can contain multiple replicas, each identified by `project_id`.`replica_id`.
+
+For example, the project [MD-A003ZP](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview) contains ten replicas:
+
+- `MD-A003ZP.1`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/overview
+- `MD-A003ZP.2`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.2/overview
+- `MD-A003ZP.3`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.3/overview
+- ...
+
+API entrypoint to get all datasets at once:
+
+- Endpoint: `/projects`
+- HTTP method: GET
+- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects)
+
+### Files
+
+API endpoint to get files for a given replica of a project:
+
+- Endpoint: `/projects/{project_id.replica_id}/filenotes`
+- HTTP method: GET
+- [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes)
+
+## Examples
+
+### Project `MD-A003ZP`
+
+Title:
+
+> MDBind 3x1k
+
+Description:
+
+> 10 ns simulation of 1ma4m pdb structure from MDBind dataset, a dynamic view of the PDBBind database
+
+- [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview)
+- [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP)
+
+Files for replica 1:
+
+- [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/files)
+- [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP.1/filenotes)
+
+### Project `MD-A001T1`
+
+Title:
+
+> All-atom molecular dynamics simulations of SARS-CoV-2 envelope protein E in the monomeric form, C4 popc
+
+Description:
+
+> The trajectories of all-atom MD simulations were obtained based on 4 starting representative conformations from the CG simulation. For each starting structure, there are six trajectories of the E protein: 3 with the protein embedded in the membrane containing POPC, and 3 with the membrane mimicking the natural ERGIC membrane (Mix: 50% POPC, 25% POPE, 10% POPI, 5% POPS, 10% cholesterol).
+
+- [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1/overview)
+- [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1)
+
+Files for replica 1:
+
+- [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1.1/files)
+- [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1.1/filenotes)
diff --git a/pyproject.toml b/pyproject.toml
@@ -73,3 +73,4 @@ scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
 scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
 scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
 scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"
+scrape-mddb = "mdverse_scrapers.scrapers.mddb:main"
diff --git a/src/mdverse_scrapers/models/dataset.py b/src/mdverse_scrapers/models/dataset.py
@@ -170,7 +170,7 @@ def format_dates(cls, value: datetime | str | None) -> str | None:
 
         Parameters
         ----------
-        cls : type[BaseDataset]
+        cls : type[DatasetMetadata]
             The Pydantic model class being validated.
         value : datetime | str | None
             The input value of the 'date' field to validate.

diff --git a/src/mdverse_scrapers/models/enums.py b/src/mdverse_scrapers/models/enums.py
@@ -20,10 +20,26 @@ class DatasetSourceName(StrEnum):
     ATLAS = "atlas"
     GPCRMD = "gpcrmd"
     NMRLIPIDS = "nmrlipids"
+    MDDB = "mddb"
+    MDPOSIT_INRIA_NODE = "mdposit_inria_node"
+    MDPOSIT_MMB_NODE = "mdposit_mmb_node"
+    MDPOSIT_CINECA_NODE = "mdposit_cineca_node"
 
 
 class ExternalDatabaseName(StrEnum):
     """External database names."""
 
     PDB = "pdb"
     UNIPROT = "uniprot"
+
+
+class MoleculeType(StrEnum):
+    """Common molecular types found in molecular dynamics simulations."""
+
+    PROTEIN = "protein"
+    NUCLEIC_ACID = "nucleic_acid"
+    ION = "ion"
+    LIPID = "lipid"
+    CARBOHYDRATE = "carbohydrate"
+    SOLVENT = "solvent"
+    SMALL_MOLECULE = "small_molecule"
diff --git a/src/mdverse_scrapers/models/simulation.py b/src/mdverse_scrapers/models/simulation.py
@@ -3,9 +3,16 @@
 import re
 from typing import Annotated
 
-from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator
+from pydantic import (
+    BaseModel,
+    ConfigDict,
+    Field,
+    StringConstraints,
+    field_validator,
+    model_validator,
+)
 
-from .enums import ExternalDatabaseName
+from .enums import ExternalDatabaseName, MoleculeType
 
 DOI = Annotated[
     str,
@@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel):
         None, min_length=1, description="Direct URL to the identifier into the database"
     )
 
+    @model_validator(mode="after")
+    def compute_url(self) -> "ExternalIdentifier":
+        """Compute the URL for the external identifier.
+
+        Parameters
+        ----------
+        self: ExternalIdentifier
+            The model instance being validated, with all fields already validated.
+
+        Returns
+        -------
+        ExternalIdentifier
+            The model instance with the URL field computed if it was not provided.
+        """
+        if self.url is not None:
+            return self
+
+        if self.database_name == ExternalDatabaseName.PDB:
+            self.url = f"https://www.rcsb.org/structure/{self.identifier}"
+        elif self.database_name == ExternalDatabaseName.UNIPROT:
+            self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}"
+
+        return self
+
 
 class Molecule(BaseModel):
     """Molecule in a simulation."""
@@ -45,18 +76,25 @@ class Molecule(BaseModel):
     model_config = ConfigDict(extra="forbid")
 
     name: str = Field(..., description="Name of the molecule.")
+    type: MoleculeType | None = Field(
+        None,
+        description="Type of the molecule."
+        "Allowed values in the MoleculeType enum. "
-        "Allowed values in the MoleculeType enum. "
+        " Allowed values in the MoleculeType enum. "
-        "Allowed values in the MoleculeType enum. "
+        " Allowed values in the MoleculeType enum. "
+        "Examples: PROTEIN, ION, LIPID...",
+    )
+    number_of_molecules: int | None = Field(
+        None,
+        ge=0,
+        description="Number of molecules of this type in the simulation.",
+    )
     number_of_atoms: int | None = Field(
         None, ge=0, description="Number of atoms in the molecule."
     )
     formula: str | None = Field(None, description="Chemical formula of the molecule.")
     sequence: str | None = Field(
         None, description="Sequence of the molecule for protein and nucleic acid."
     )
-    number_of_molecules: int | None = Field(
-        None,
-        ge=0,
-        description="Number of molecules of this type in the simulation.",
-    )
+    inchikey: str | None = Field(None, description="InChIKey of the molecule.")
     external_identifiers: list[ExternalIdentifier] | None = Field(
         None,
         description=("List of external database identifiers for this molecule."),
@@ -66,8 +104,9 @@ class Molecule(BaseModel):
 class ForceFieldModel(BaseModel):
     """Forcefield or Model used in a simulation."""
 
-    # Ensure scraped metadata matches the expected schema exactly.
-    model_config = ConfigDict(extra="forbid")
+    # Ensure scraped metadata matches the expected schema exactly
+    # and version is coerced to string when needed.
+    model_config = ConfigDict(extra="forbid", coerce_numbers_to_str=True)
 
     name: str = Field(
         ...,
@@ -81,8 +120,9 @@ class ForceFieldModel(BaseModel):
 class Software(BaseModel):
     """Simulation software or tool used in a simulation."""
 
-    # Ensure scraped metadata matches the expected schema exactly.
-    model_config = ConfigDict(extra="forbid")
+    # Ensure scraped metadata matches the expected schema exactly
+    # and version is coerced to string when needed.
+    model_config = ConfigDict(extra="forbid", coerce_numbers_to_str=True)
 
     name: str = Field(
         ...,