Feature/add mdposit scraper by Essmaw · Pull Request #61 · MDverse/mdws

Essmaw · 2026-01-19T23:07:42Z

No description provided.

src/mdverse_scrapers/models/dataset.py

…etc) as discussed in #73

…project and new models

pierrepo

Thanks @Essmaw
Could please add the docs/mddb.md file with the related documentation?

Copilot

Pull request overview

Adds support for scraping MDposit (MDDB) datasets/files and integrates this new source into the existing MDverse scraping CLI and data model.

Changes:

Introduces a new MDposit scraper to collect dataset and file metadata from two MDDB nodes.
Extends the data model with new enums (MDDB/MDposit sources) and adds Molecule.type classification.
Registers a new CLI entry point and documents how to run the scraper.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/mdverse_scrapers/scrapers/mdposit.py`	New scraper implementation for MDposit datasets + file metadata extraction and CLI.
`src/mdverse_scrapers/models/simulation.py`	Adds optional `type` field to `Molecule` for molecule classification.
`src/mdverse_scrapers/models/enums.py`	Adds MDDB/MDposit source names and introduces `MoleculeType`.
`src/mdverse_scrapers/models/dataset.py`	Docstring correction for validator `cls` type annotation.
`ruff.toml`	Adds a Ruff ignore for `PERF401`.
`pyproject.toml`	Registers `scrape-mdposit` console script entry point.
`README.md`	Documents how to run the new MDposit scraper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

README.md

src/mdverse_scrapers/scrapers/mddb.py

src/mdverse_scrapers/scrapers/mdposit.py

src/mdverse_scrapers/models/simulation.py

… ExternalDatabaseName

…Prot protein name retrieval

…ally computed URLs

…in_system in tests

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/mdverse_scrapers/models/simulation.py

tests/models/test_simulation_molecule.py

src/mdverse_scrapers/scrapers/mddb.py

docs/mddb.md

src/mdverse_scrapers/scrapers/mddb.py

tests/models/test_simulation_molecule.py

src/mdverse_scrapers/scrapers/mddb.py

docs/mddb.md

…molecules in Molecule model

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T19:29:06Z

src/mdverse_scrapers/scrapers/mddb.py

+MDDB_REPOSITORIES = {
+    DatasetSourceName.MDPOSIT_MMB_NODE: "https://mmb-dev.mddbr.eu/api/rest/v1",
+    DatasetSourceName.MDPOSIT_INRIA_NODE: "https://inria.mddbr.eu/api/rest/v1",
+}


The configured MMB node base URL uses https://mmb-dev.mddbr.eu/api/rest/v1, but the newly added documentation (docs/mddb.md) states the API base URL is https://mmb.mddbr.eu/api/rest/v1. This inconsistency is likely to break scraping against the intended production endpoint; align the code and docs on the correct base URL(s).

Copilot · 2026-02-06T19:29:06Z

src/mdverse_scrapers/scrapers/mddb.py

+        version = dataset_metadata.get("VERSION")
+        if not name:
+            return None
+        return [Software(name=name, version=str(version))]


When VERSION is missing, str(version) becomes the literal string 'None', which will be stored in Software.version. Prefer leaving version=None when it's absent, and only cast to str when the value is not None.

Suggested change

return [Software(name=name, version=str(version))]

return [

Software(name=name, version=str(version) if version is not None else None)

]

Copilot · 2026-02-06T19:29:07Z

src/mdverse_scrapers/scrapers/mddb.py

+        return "Unknow protein"
+    # Defaut value for protein name:


Spelling/wording issues in the return value and comments: "Unknow protein" should be "Unknown protein", and # Defaut value should be # Default value (or similar).

Suggested change

return "Unknow protein"

# Defaut value for protein name:

return "Unknown protein"

# Default value for protein name:

Copilot · 2026-02-06T19:29:07Z

src/mdverse_scrapers/scrapers/mddb.py

+    Returns
+    -------
+    str
+        Protein full name if available, None otherwise.


The docstring says the function returns a protein name "if available, None otherwise", but the implementation always returns a string (either a fetched name or a default). Update the docstring to match the actual behavior (or change the implementation to return None on failure if that's desired).

Suggested change

Protein full name if available, None otherwise.

Protein full name if available, or a fallback descriptive name otherwise.

Copilot · 2026-02-06T19:29:07Z

docs/mddb.md

+- Project id: `A025U.1`
+- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
+- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
+
+Remark: no description is provided for this dataset.
+
+- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)


In the A025U example, the text says the project id is A025U.1, but both the API link and the files link use A025U.2. Please make the example consistent so readers don't try to query the wrong replica.

Suggested change

- Project id: `A025U.1`

- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)

- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)

Remark: no description is provided for this dataset.

- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)

- Project id: `A025U.2`

- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview)

- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)

Remark: no description is provided for this dataset.

- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files)

Copilot · 2026-02-06T19:29:08Z

src/mdverse_scrapers/scrapers/mddb.py

+        a = dataset_metadata.get("AUTHORS")
+        author_names = a if isinstance(a, list) else [a] if a else None
+        metadata = {
+            "dataset_repository_name": node_name.value,


dataset_repository_name is being set to node_name.value (a string), but DatasetMetadata.dataset_repository_name is typed as DatasetSourceName. This will fail validation in normalize_datasets_metadata (other scrapers pass the enum member directly). Set this field to node_name instead of node_name.value (and keep using .value only for filesystem paths/logging).

Suggested change

"dataset_repository_name": node_name.value,

"dataset_repository_name": node_name,

Copilot · 2026-02-06T19:29:08Z

src/mdverse_scrapers/scrapers/mddb.py

+        if node_name is DatasetSourceName.MDPOSIT_MMB_NODE:
+            dataset_url = f"https://mmb-dev.mddbr.eu/#/id/{dataset_id}/overview"
+        elif node_name is DatasetSourceName.MDPOSIT_INRIA_NODE:
+            dataset_url = f"https://dynarepo.inria.fr/#/id/{dataset_id}/overview"
+        else:
+            logger.warning(
+                f"Unknown MDDB node '{node_name}'."
+                f"Cannot build entry URL for dataset {dataset_id}."
+            )


dataset_url is only assigned in the if/elif branches; in the else branch you log a warning but still use dataset_url when building metadata, which will raise UnboundLocalError. Initialize dataset_url to a safe default (e.g., None or empty string) before the conditional, or raise/continue in the else branch.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-08T20:20:49Z

src/mdverse_scrapers/scrapers/mddb.py

+        logger.warning("Uniprot ID is weird. Abording.")
+        return "Unknown protein"


Fix typos/casing in this log message (e.g., “Uniprot” → “UniProt”, “Abording” → “Aborting”).

Copilot · 2026-02-08T20:20:49Z

src/mdverse_scrapers/scrapers/mddb.py

+    mddb_nodes: dict
+        Dictionnary of MDDB nodes.
+    logger: "loguru.Logger"


Spelling in docstring: “Dictionnary” should be “Dictionary”.

Copilot · 2026-02-08T20:20:49Z

src/mdverse_scrapers/scrapers/mddb.py

+    dict
+        Dictionnary for replicas by dataset.
+    """


Spelling in docstring: “Dictionnary” should be “Dictionary”.

Copilot · 2026-02-08T20:20:49Z

docs/mddb.md

+APY entrypoint to get the total number of projects:
+
+- Endpoint: `/projects/summary`
+- HTTP methode: GET


Typo in docs: “APY entrypoint” should be “API entrypoint” (or “API entry point”).

Suggested change

APY entrypoint to get the total number of projects:

- Endpoint: `/projects/summary`

- HTTP methode: GET

API entrypoint to get the total number of projects:

- Endpoint: `/projects/summary`

- HTTP method: GET

Copilot · 2026-02-08T20:20:50Z

docs/mddb.md

+APY entrypoint to get the total number of projects:
+
+- Endpoint: `/projects/summary`
+- HTTP methode: GET


Typo in docs: “HTTP methode” should be “HTTP method”.

Suggested change

- HTTP methode: GET

- HTTP method: GET

Copilot · 2026-02-08T20:20:52Z

src/mdverse_scrapers/scrapers/mddb.py

+    dataset: DatasetMetadata
+        Normalized dataset to get files metadata for.
+    replica_id: int
+        Identifer of the corresponding replica associated with the files.


Typo in docstring: “Identifer” should be “Identifier”.

Suggested change

Identifer of the corresponding replica associated with the files.

Identifier of the corresponding replica associated with the files.

Copilot · 2026-02-08T20:20:52Z

src/mdverse_scrapers/scrapers/mddb.py

+        # See for instance; https://rest.uniprot.org/uniprotkb/Q16968
+        if submission_name and isinstance(submission_name, list):
+            protein_name = submission_name[0].get("fullName", {}).get("value")
+        # Or a dictionnary.


Typo in comment: “dictionnary” should be “dictionary”.

Suggested change

# Or a dictionnary.

# Or a dictionary.

Copilot · 2026-02-08T20:20:52Z

src/mdverse_scrapers/scrapers/mddb.py

+    if uniprot_id in ("noref", "notfound"):
+        logger.warning("Uniprot ID is weird. Abording.")
+        return "Unknown protein"
+    # Defaut value for protein name:


Typo in comment: “Defaut” should be “Default”.

Suggested change

# Defaut value for protein name:

# Default value for protein name:

Copilot · 2026-02-08T20:20:52Z

src/mdverse_scrapers/scrapers/mddb.py

+    protein_name = (
+        response.json()
+        .get("proteinDescription", {})
+        .get("recommendedName", {})
+        .get("fullName", {})
+        .get("value")
+    )
+    # Second option: try to get the submitted name.
+    # See for instance: https://rest.uniprot.org/uniprotkb/Q51760
+    if not protein_name:
+        submission_name = (
+            response.json().get("proteinDescription", {}).get("submissionNames")
+        )


response.json() is called multiple times in this function; each call re-parses the response body. Parse once into a local variable (e.g., data = response.json()) and reuse it when extracting proteinDescription fields to reduce overhead when scraping many proteins.

Copilot · 2026-02-08T20:20:53Z

src/mdverse_scrapers/scrapers/mddb.py

+def extract_datasets_metadata(
+    datasets: list[dict],
+    mddb_nodes: dict,
+    client: httpx.Client,
+    logger: "loguru.Logger" = loguru.logger,
+) -> tuple[list[dict], dict]:


This new scraper adds a fair amount of parsing logic (e.g., extract_datasets_metadata, extract_proteins, extract_files_metadata) but there are no tests covering these transformations. Given the repo already has scraper tests (e.g., tests/scrapers/test_figshare.py), consider adding unit tests using small fixture JSON payloads to lock down expected field mapping and edge cases (missing sequences/refs, multiple replicas, etc.).

essmaw added 3 commits January 20, 2026 00:04

feat: add MDposit dataset scraping script.

0a37b89

feat(models): add MDPOSIT repository and MDDB project fields.

caf2865

feat(cli): add README command and scrape-mdposit entry point.

9147f32

Essmaw marked this pull request as draft January 19, 2026 23:07

pierrepo reviewed Jan 20, 2026

View reviewed changes

src/mdverse_scrapers/models/dataset.py Show resolved Hide resolved

essmaw added 4 commits January 29, 2026 11:27

merge: sync main into update-mdposit-scraper

f809832

refactor(simulation-model): add molecule type field (protein, lipid, …

e1a4e9d

…etc) as discussed in #73

chore(ruff): disable PERF401 for model instance appends

fb283e1

refactor(mdposit-scraper): update to scrape using both nodes of MDDB …

064d94b

…project and new models

Essmaw marked this pull request as ready for review January 29, 2026 15:59

Essmaw requested a review from pierrepo January 29, 2026 16:00

Essmaw added the scraper label Jan 29, 2026

Essmaw requested a review from Copilot January 29, 2026 16:08

Copilot started reviewing on behalf of Essmaw January 29, 2026 16:09 View session

pierrepo reviewed Jan 29, 2026

View reviewed changes

Copilot AI reviewed Jan 29, 2026

View reviewed changes

essmaw and others added 3 commits February 4, 2026 23:33

docs: adding the mddb documentation + update the readme and command line

e150d24

feat: refactor the code and resolve AttributeError

e3c5e38

merge: sync main into update-mdposit-scraper

cfe2622

Essmaw marked this pull request as draft February 5, 2026 12:52

Essmaw added 6 commits February 5, 2026 20:09

feat: add URL computation for ExternalIdentifier based on database name

5b01789

Fix merging of new datasource names into DatasetSourceName instead of…

5533d8b

… ExternalDatabaseName

feat: enhance molecule extraction to fit the new model and adding Uni…

9ebc838

…Prot protein name retrieval

test(simulation): test URL computation for ExternalIdentifier

96793e5

tests: refactor tests for ExternalIdentifier to account for automatic…

f031e28

…ally computed URLs

refactor: rename number_of_molecules to number_of_this_molecule_type_…

6cb949d

…in_system in tests

Essmaw requested a review from Copilot February 6, 2026 13:14

Copilot started reviewing on behalf of Essmaw February 6, 2026 13:14 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

refactor: rename number_of_this_molecule_type_in_system to number_of_…

3871d22

…molecules in Molecule model

Copilot started reviewing on behalf of pierrepo February 6, 2026 19:24 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

pierrepo added 18 commits February 7, 2026 01:20

feat: Add default DatasetSourceName

f987ea7

feat: Coexerce verstion to str

059d51f

docs: Update MDDB documentation and examples

ebf4470

refactor: Remove more try/except

63181fa

refactor: Split log message

7a5f580

fix: Fix error when forcefield metadata is undifiend

d0324ee

fix: Handle case with no protein sequence nor Uniprot identifier

8b57c76

fix: Handle case when no software is available

024efa9

feat: Add InChIKey field for Molecule model

88b9955

fix: Fix dataset_url_in_repository field

dd724a7

docs: Print dataset URL in API

9e0374f

feat: Align uniprot identifiers with protein sequences

6b959da

feat: Add replicas logic in file metadata extraction

e3a353c

feat: Add rules to avoid lengthy try / except blocks

7068584

fix: Add special case for 'inr' (INRIA) node name

9cd0a88

feat: Add Cineca MDDB node

40ea3ca

feat: Add another way to get protein name from Uniprot

a8ed77b

fix: Update logic to fetch protein name from Uniprot

7884275

pierrepo requested a review from Copilot February 8, 2026 20:14

Copilot started reviewing on behalf of pierrepo February 8, 2026 20:14 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

pierrepo added 5 commits February 11, 2026 14:09

docs: Fix typos

71f7c43

docs: Relax scraping time

3d003b3

chore: Reallow PERF401 rules

cf32a04

docs: Remove MDDB node names

91595f1

refactor: Clean code

6658973

pierrepo merged commit c5b11a6 into main Feb 11, 2026
2 checks passed

pierrepo deleted the feature/add-mdposit-scraper branch February 11, 2026 13:36

	Protein full name if available, None otherwise.
	Protein full name if available, or a fallback descriptive name otherwise.

	"dataset_repository_name": node_name.value,
	"dataset_repository_name": node_name,

		logger.warning("Uniprot ID is weird. Abording.")
		return "Unknown protein"

	Identifer of the corresponding replica associated with the files.
	Identifier of the corresponding replica associated with the files.

	# Defaut value for protein name:
	# Default value for protein name:

Comments

Conversation

Essmaw commented Jan 19, 2026

Uh oh!

Uh oh!

pierrepo left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026