Skip to content

Comments

Feature/add mdposit scraper#61

Merged
pierrepo merged 45 commits intomainfrom
feature/add-mdposit-scraper
Feb 11, 2026
Merged

Feature/add mdposit scraper#61
pierrepo merged 45 commits intomainfrom
feature/add-mdposit-scraper

Conversation

@Essmaw
Copy link
Collaborator

@Essmaw Essmaw commented Jan 19, 2026

No description provided.

@Essmaw Essmaw marked this pull request as draft January 19, 2026 23:07
@Essmaw Essmaw marked this pull request as ready for review January 29, 2026 15:59
@Essmaw Essmaw requested a review from pierrepo January 29, 2026 16:00
@Essmaw Essmaw added the scraper label Jan 29, 2026
@Essmaw Essmaw requested a review from Copilot January 29, 2026 16:08
Copy link
Member

@pierrepo pierrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Essmaw
Could please add the docs/mddb.md file with the related documentation?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for scraping MDposit (MDDB) datasets/files and integrates this new source into the existing MDverse scraping CLI and data model.

Changes:

  • Introduces a new MDposit scraper to collect dataset and file metadata from two MDDB nodes.
  • Extends the data model with new enums (MDDB/MDposit sources) and adds Molecule.type classification.
  • Registers a new CLI entry point and documents how to run the scraper.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/mdverse_scrapers/scrapers/mdposit.py New scraper implementation for MDposit datasets + file metadata extraction and CLI.
src/mdverse_scrapers/models/simulation.py Adds optional type field to Molecule for molecule classification.
src/mdverse_scrapers/models/enums.py Adds MDDB/MDposit source names and introduces MoleculeType.
src/mdverse_scrapers/models/dataset.py Docstring correction for validator cls type annotation.
ruff.toml Adds a Ruff ignore for PERF401.
pyproject.toml Registers scrape-mdposit console script entry point.
README.md Documents how to run the new MDposit scraper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Essmaw Essmaw marked this pull request as draft February 5, 2026 12:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 37 to 40
MDDB_REPOSITORIES = {
DatasetSourceName.MDPOSIT_MMB_NODE: "https://mmb-dev.mddbr.eu/api/rest/v1",
DatasetSourceName.MDPOSIT_INRIA_NODE: "https://inria.mddbr.eu/api/rest/v1",
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configured MMB node base URL uses https://mmb-dev.mddbr.eu/api/rest/v1, but the newly added documentation (docs/mddb.md) states the API base URL is https://mmb.mddbr.eu/api/rest/v1. This inconsistency is likely to break scraping against the intended production endpoint; align the code and docs on the correct base URL(s).

Copilot uses AI. Check for mistakes.
version = dataset_metadata.get("VERSION")
if not name:
return None
return [Software(name=name, version=str(version))]
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When VERSION is missing, str(version) becomes the literal string 'None', which will be stored in Software.version. Prefer leaving version=None when it's absent, and only cast to str when the value is not None.

Suggested change
return [Software(name=name, version=str(version))]
return [
Software(name=name, version=str(version) if version is not None else None)
]

Copilot uses AI. Check for mistakes.
Comment on lines 234 to 235
return "Unknow protein"
# Defaut value for protein name:
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/wording issues in the return value and comments: "Unknow protein" should be "Unknown protein", and # Defaut value should be # Default value (or similar).

Suggested change
return "Unknow protein"
# Defaut value for protein name:
return "Unknown protein"
# Default value for protein name:

Copilot uses AI. Check for mistakes.
Returns
-------
str
Protein full name if available, None otherwise.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the function returns a protein name "if available, None otherwise", but the implementation always returns a string (either a fetched name or a default). Update the docstring to match the actual behavior (or change the implementation to return None on failure if that's desired).

Suggested change
Protein full name if available, None otherwise.
Protein full name if available, or a fallback descriptive name otherwise.

Copilot uses AI. Check for mistakes.
docs/mddb.md Outdated
Comment on lines 70 to 76
- Project id: `A025U.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)

Remark: no description is provided for this dataset.

- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the A025U example, the text says the project id is A025U.1, but both the API link and the files link use A025U.2. Please make the example consistent so readers don't try to query the wrong replica.

Suggested change
- Project id: `A025U.1`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
Remark: no description is provided for this dataset.
- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files)
- Project id: `A025U.2`
- [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview)
- [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2)
Remark: no description is provided for this dataset.
- [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files)

Copilot uses AI. Check for mistakes.
a = dataset_metadata.get("AUTHORS")
author_names = a if isinstance(a, list) else [a] if a else None
metadata = {
"dataset_repository_name": node_name.value,
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_repository_name is being set to node_name.value (a string), but DatasetMetadata.dataset_repository_name is typed as DatasetSourceName. This will fail validation in normalize_datasets_metadata (other scrapers pass the enum member directly). Set this field to node_name instead of node_name.value (and keep using .value only for filesystem paths/logging).

Suggested change
"dataset_repository_name": node_name.value,
"dataset_repository_name": node_name,

Copilot uses AI. Check for mistakes.
Comment on lines 518 to 526
if node_name is DatasetSourceName.MDPOSIT_MMB_NODE:
dataset_url = f"https://mmb-dev.mddbr.eu/#/id/{dataset_id}/overview"
elif node_name is DatasetSourceName.MDPOSIT_INRIA_NODE:
dataset_url = f"https://dynarepo.inria.fr/#/id/{dataset_id}/overview"
else:
logger.warning(
f"Unknown MDDB node '{node_name}'."
f"Cannot build entry URL for dataset {dataset_id}."
)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_url is only assigned in the if/elif branches; in the else branch you log a warning but still use dataset_url when building metadata, which will raise UnboundLocalError. Initialize dataset_url to a safe default (e.g., None or empty string) before the conditional, or raise/continue in the else branch.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 250 to 251
logger.warning("Uniprot ID is weird. Abording.")
return "Unknown protein"
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typos/casing in this log message (e.g., “Uniprot” → “UniProt”, “Abording” → “Aborting”).

Copilot uses AI. Check for mistakes.
Comment on lines 609 to 611
mddb_nodes: dict
Dictionnary of MDDB nodes.
logger: "loguru.Logger"
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling in docstring: “Dictionnary” should be “Dictionary”.

Copilot uses AI. Check for mistakes.
Comment on lines 618 to 620
dict
Dictionnary for replicas by dataset.
"""
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling in docstring: “Dictionnary” should be “Dictionary”.

Copilot uses AI. Check for mistakes.
docs/mddb.md Outdated
Comment on lines 22 to 25
APY entrypoint to get the total number of projects:

- Endpoint: `/projects/summary`
- HTTP methode: GET
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docs: “APY entrypoint” should be “API entrypoint” (or “API entry point”).

Suggested change
APY entrypoint to get the total number of projects:
- Endpoint: `/projects/summary`
- HTTP methode: GET
API entrypoint to get the total number of projects:
- Endpoint: `/projects/summary`
- HTTP method: GET

Copilot uses AI. Check for mistakes.
docs/mddb.md Outdated
APY entrypoint to get the total number of projects:

- Endpoint: `/projects/summary`
- HTTP methode: GET
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docs: “HTTP methode” should be “HTTP method”.

Suggested change
- HTTP methode: GET
- HTTP method: GET

Copilot uses AI. Check for mistakes.
dataset: DatasetMetadata
Normalized dataset to get files metadata for.
replica_id: int
Identifer of the corresponding replica associated with the files.
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docstring: “Identifer” should be “Identifier”.

Suggested change
Identifer of the corresponding replica associated with the files.
Identifier of the corresponding replica associated with the files.

Copilot uses AI. Check for mistakes.
# See for instance; https://rest.uniprot.org/uniprotkb/Q16968
if submission_name and isinstance(submission_name, list):
protein_name = submission_name[0].get("fullName", {}).get("value")
# Or a dictionnary.
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “dictionnary” should be “dictionary”.

Suggested change
# Or a dictionnary.
# Or a dictionary.

Copilot uses AI. Check for mistakes.
if uniprot_id in ("noref", "notfound"):
logger.warning("Uniprot ID is weird. Abording.")
return "Unknown protein"
# Defaut value for protein name:
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “Defaut” should be “Default”.

Suggested change
# Defaut value for protein name:
# Default value for protein name:

Copilot uses AI. Check for mistakes.
Comment on lines 265 to 277
protein_name = (
response.json()
.get("proteinDescription", {})
.get("recommendedName", {})
.get("fullName", {})
.get("value")
)
# Second option: try to get the submitted name.
# See for instance: https://rest.uniprot.org/uniprotkb/Q51760
if not protein_name:
submission_name = (
response.json().get("proteinDescription", {}).get("submissionNames")
)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.json() is called multiple times in this function; each call re-parses the response body. Parse once into a local variable (e.g., data = response.json()) and reuse it when extracting proteinDescription fields to reduce overhead when scraping many proteins.

Copilot uses AI. Check for mistakes.
Comment on lines +596 to +601
def extract_datasets_metadata(
datasets: list[dict],
mddb_nodes: dict,
client: httpx.Client,
logger: "loguru.Logger" = loguru.logger,
) -> tuple[list[dict], dict]:
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new scraper adds a fair amount of parsing logic (e.g., extract_datasets_metadata, extract_proteins, extract_files_metadata) but there are no tests covering these transformations. Given the repo already has scraper tests (e.g., tests/scrapers/test_figshare.py), consider adding unit tests using small fixture JSON payloads to lock down expected field mapping and edge cases (missing sequences/refs, multiple replicas, etc.).

Copilot uses AI. Check for mistakes.
@pierrepo pierrepo merged commit c5b11a6 into main Feb 11, 2026
2 checks passed
@pierrepo pierrepo deleted the feature/add-mdposit-scraper branch February 11, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants