Conversation
…etc) as discussed in #73
…project and new models
There was a problem hiding this comment.
Pull request overview
Adds support for scraping MDposit (MDDB) datasets/files and integrates this new source into the existing MDverse scraping CLI and data model.
Changes:
- Introduces a new MDposit scraper to collect dataset and file metadata from two MDDB nodes.
- Extends the data model with new enums (MDDB/MDposit sources) and adds
Molecule.typeclassification. - Registers a new CLI entry point and documents how to run the scraper.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/mdverse_scrapers/scrapers/mdposit.py |
New scraper implementation for MDposit datasets + file metadata extraction and CLI. |
src/mdverse_scrapers/models/simulation.py |
Adds optional type field to Molecule for molecule classification. |
src/mdverse_scrapers/models/enums.py |
Adds MDDB/MDposit source names and introduces MoleculeType. |
src/mdverse_scrapers/models/dataset.py |
Docstring correction for validator cls type annotation. |
ruff.toml |
Adds a Ruff ignore for PERF401. |
pyproject.toml |
Registers scrape-mdposit console script entry point. |
README.md |
Documents how to run the new MDposit scraper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… ExternalDatabaseName
…Prot protein name retrieval
…ally computed URLs
…in_system in tests
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…molecules in Molecule model
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MDDB_REPOSITORIES = { | ||
| DatasetSourceName.MDPOSIT_MMB_NODE: "https://mmb-dev.mddbr.eu/api/rest/v1", | ||
| DatasetSourceName.MDPOSIT_INRIA_NODE: "https://inria.mddbr.eu/api/rest/v1", | ||
| } |
There was a problem hiding this comment.
The configured MMB node base URL uses https://mmb-dev.mddbr.eu/api/rest/v1, but the newly added documentation (docs/mddb.md) states the API base URL is https://mmb.mddbr.eu/api/rest/v1. This inconsistency is likely to break scraping against the intended production endpoint; align the code and docs on the correct base URL(s).
| version = dataset_metadata.get("VERSION") | ||
| if not name: | ||
| return None | ||
| return [Software(name=name, version=str(version))] |
There was a problem hiding this comment.
When VERSION is missing, str(version) becomes the literal string 'None', which will be stored in Software.version. Prefer leaving version=None when it's absent, and only cast to str when the value is not None.
| return [Software(name=name, version=str(version))] | |
| return [ | |
| Software(name=name, version=str(version) if version is not None else None) | |
| ] |
| return "Unknow protein" | ||
| # Defaut value for protein name: |
There was a problem hiding this comment.
Spelling/wording issues in the return value and comments: "Unknow protein" should be "Unknown protein", and # Defaut value should be # Default value (or similar).
| return "Unknow protein" | |
| # Defaut value for protein name: | |
| return "Unknown protein" | |
| # Default value for protein name: |
| Returns | ||
| ------- | ||
| str | ||
| Protein full name if available, None otherwise. |
There was a problem hiding this comment.
The docstring says the function returns a protein name "if available, None otherwise", but the implementation always returns a string (either a fetched name or a default). Update the docstring to match the actual behavior (or change the implementation to return None on failure if that's desired).
| Protein full name if available, None otherwise. | |
| Protein full name if available, or a fallback descriptive name otherwise. |
docs/mddb.md
Outdated
| - Project id: `A025U.1` | ||
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview) | ||
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2) | ||
|
|
||
| Remark: no description is provided for this dataset. | ||
|
|
||
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files) |
There was a problem hiding this comment.
In the A025U example, the text says the project id is A025U.1, but both the API link and the files link use A025U.2. Please make the example consistent so readers don't try to query the wrong replica.
| - Project id: `A025U.1` | |
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/overview) | |
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2) | |
| Remark: no description is provided for this dataset. | |
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U/files) | |
| - Project id: `A025U.2` | |
| - [project on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/overview) | |
| - [project on MDposit API](https://mmb.mddbr.eu/api/rest/current/projects/A025U.2) | |
| Remark: no description is provided for this dataset. | |
| - [files on MDposit GUI](https://mmb.mddbr.eu/#/id/A025U.2/files) |
| a = dataset_metadata.get("AUTHORS") | ||
| author_names = a if isinstance(a, list) else [a] if a else None | ||
| metadata = { | ||
| "dataset_repository_name": node_name.value, |
There was a problem hiding this comment.
dataset_repository_name is being set to node_name.value (a string), but DatasetMetadata.dataset_repository_name is typed as DatasetSourceName. This will fail validation in normalize_datasets_metadata (other scrapers pass the enum member directly). Set this field to node_name instead of node_name.value (and keep using .value only for filesystem paths/logging).
| "dataset_repository_name": node_name.value, | |
| "dataset_repository_name": node_name, |
| if node_name is DatasetSourceName.MDPOSIT_MMB_NODE: | ||
| dataset_url = f"https://mmb-dev.mddbr.eu/#/id/{dataset_id}/overview" | ||
| elif node_name is DatasetSourceName.MDPOSIT_INRIA_NODE: | ||
| dataset_url = f"https://dynarepo.inria.fr/#/id/{dataset_id}/overview" | ||
| else: | ||
| logger.warning( | ||
| f"Unknown MDDB node '{node_name}'." | ||
| f"Cannot build entry URL for dataset {dataset_id}." | ||
| ) |
There was a problem hiding this comment.
dataset_url is only assigned in the if/elif branches; in the else branch you log a warning but still use dataset_url when building metadata, which will raise UnboundLocalError. Initialize dataset_url to a safe default (e.g., None or empty string) before the conditional, or raise/continue in the else branch.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.warning("Uniprot ID is weird. Abording.") | ||
| return "Unknown protein" |
There was a problem hiding this comment.
Fix typos/casing in this log message (e.g., “Uniprot” → “UniProt”, “Abording” → “Aborting”).
| mddb_nodes: dict | ||
| Dictionnary of MDDB nodes. | ||
| logger: "loguru.Logger" |
There was a problem hiding this comment.
Spelling in docstring: “Dictionnary” should be “Dictionary”.
| dict | ||
| Dictionnary for replicas by dataset. | ||
| """ |
There was a problem hiding this comment.
Spelling in docstring: “Dictionnary” should be “Dictionary”.
docs/mddb.md
Outdated
| APY entrypoint to get the total number of projects: | ||
|
|
||
| - Endpoint: `/projects/summary` | ||
| - HTTP methode: GET |
There was a problem hiding this comment.
Typo in docs: “APY entrypoint” should be “API entrypoint” (or “API entry point”).
| APY entrypoint to get the total number of projects: | |
| - Endpoint: `/projects/summary` | |
| - HTTP methode: GET | |
| API entrypoint to get the total number of projects: | |
| - Endpoint: `/projects/summary` | |
| - HTTP method: GET |
docs/mddb.md
Outdated
| APY entrypoint to get the total number of projects: | ||
|
|
||
| - Endpoint: `/projects/summary` | ||
| - HTTP methode: GET |
There was a problem hiding this comment.
Typo in docs: “HTTP methode” should be “HTTP method”.
| - HTTP methode: GET | |
| - HTTP method: GET |
| dataset: DatasetMetadata | ||
| Normalized dataset to get files metadata for. | ||
| replica_id: int | ||
| Identifer of the corresponding replica associated with the files. |
There was a problem hiding this comment.
Typo in docstring: “Identifer” should be “Identifier”.
| Identifer of the corresponding replica associated with the files. | |
| Identifier of the corresponding replica associated with the files. |
| # See for instance; https://rest.uniprot.org/uniprotkb/Q16968 | ||
| if submission_name and isinstance(submission_name, list): | ||
| protein_name = submission_name[0].get("fullName", {}).get("value") | ||
| # Or a dictionnary. |
There was a problem hiding this comment.
Typo in comment: “dictionnary” should be “dictionary”.
| # Or a dictionnary. | |
| # Or a dictionary. |
| if uniprot_id in ("noref", "notfound"): | ||
| logger.warning("Uniprot ID is weird. Abording.") | ||
| return "Unknown protein" | ||
| # Defaut value for protein name: |
There was a problem hiding this comment.
Typo in comment: “Defaut” should be “Default”.
| # Defaut value for protein name: | |
| # Default value for protein name: |
| protein_name = ( | ||
| response.json() | ||
| .get("proteinDescription", {}) | ||
| .get("recommendedName", {}) | ||
| .get("fullName", {}) | ||
| .get("value") | ||
| ) | ||
| # Second option: try to get the submitted name. | ||
| # See for instance: https://rest.uniprot.org/uniprotkb/Q51760 | ||
| if not protein_name: | ||
| submission_name = ( | ||
| response.json().get("proteinDescription", {}).get("submissionNames") | ||
| ) |
There was a problem hiding this comment.
response.json() is called multiple times in this function; each call re-parses the response body. Parse once into a local variable (e.g., data = response.json()) and reuse it when extracting proteinDescription fields to reduce overhead when scraping many proteins.
| def extract_datasets_metadata( | ||
| datasets: list[dict], | ||
| mddb_nodes: dict, | ||
| client: httpx.Client, | ||
| logger: "loguru.Logger" = loguru.logger, | ||
| ) -> tuple[list[dict], dict]: |
There was a problem hiding this comment.
This new scraper adds a fair amount of parsing logic (e.g., extract_datasets_metadata, extract_proteins, extract_files_metadata) but there are no tests covering these transformations. Given the repo already has scraper tests (e.g., tests/scrapers/test_figshare.py), consider adding unit tests using small fixture JSON payloads to lock down expected field mapping and edge cases (missing sequences/refs, multiple replicas, etc.).
No description provided.