-
Notifications
You must be signed in to change notification settings - Fork 4
Feature/add mdposit scraper #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
0a37b89
feat: add MDposit dataset scraping script.
caf2865
feat(models): add MDPOSIT repository and MDDB project fields.
9147f32
feat(cli): add README command and scrape-mdposit entry point.
f809832
merge: sync main into update-mdposit-scraper
e1a4e9d
refactor(simulation-model): add molecule type field (protein, lipid, …
fb283e1
chore(ruff): disable PERF401 for model instance appends
064d94b
refactor(mdposit-scraper): update to scrape using both nodes of MDDB …
e150d24
docs: adding the mddb documentation + update the readme and command …
e3c5e38
feat: refactor the code and resolve AttributeError
cfe2622
merge: sync main into update-mdposit-scraper
Essmaw 5b01789
feat: add URL computation for ExternalIdentifier based on database name
Essmaw 5533d8b
Fix merging of new datasource names into DatasetSourceName instead of…
Essmaw 9ebc838
feat: enhance molecule extraction to fit the new model and adding Un…
Essmaw 96793e5
test(simulation): test URL computation for ExternalIdentifier
Essmaw f031e28
tests: refactor tests for ExternalIdentifier to account for automatic…
Essmaw 6cb949d
refactor: rename number_of_molecules to number_of_this_molecule_type_…
Essmaw 3871d22
refactor: rename number_of_this_molecule_type_in_system to number_of_…
Essmaw c9be76f
tests: refactor with `number_of_molecules` attribute and adding speci…
Essmaw 542f54a
fixes(mddb scraper): correct spelling errors, improve parameter descr…
Essmaw 21943fc
docs: correct spelling errors
Essmaw d826989
fix: Revert to 'software' field
pierrepo 671008c
refactor: Reduce usage and scope of try/except blocks
pierrepo f987ea7
feat: Add default DatasetSourceName
pierrepo 059d51f
feat: Coexerce verstion to str
pierrepo ebf4470
docs: Update MDDB documentation and examples
pierrepo 63181fa
refactor: Remove more try/except
pierrepo 7a5f580
refactor: Split log message
pierrepo d0324ee
fix: Fix error when forcefield metadata is undifiend
pierrepo 8b57c76
fix: Handle case with no protein sequence nor Uniprot identifier
pierrepo 024efa9
fix: Handle case when no software is available
pierrepo 88b9955
feat: Add InChIKey field for Molecule model
pierrepo dd724a7
fix: Fix dataset_url_in_repository field
pierrepo 9e0374f
docs: Print dataset URL in API
pierrepo 6b959da
feat: Align uniprot identifiers with protein sequences
pierrepo e3a353c
feat: Add replicas logic in file metadata extraction
pierrepo 7068584
feat: Add rules to avoid lengthy try / except blocks
pierrepo 9cd0a88
fix: Add special case for 'inr' (INRIA) node name
pierrepo 40ea3ca
feat: Add Cineca MDDB node
pierrepo a8ed77b
feat: Add another way to get protein name from Uniprot
pierrepo 7884275
fix: Update logic to fetch protein name from Uniprot
pierrepo 71f7c43
docs: Fix typos
pierrepo 3d003b3
docs: Relax scraping time
pierrepo cf32a04
chore: Reallow PERF401 rules
pierrepo 91595f1
docs: Remove MDDB node names
pierrepo 6658973
refactor: Clean code
pierrepo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # MDDB | ||
|
|
||
| > The [MDDB (Molecular Dynamics Data Bank) project](https://mddbr.eu/about/) is an initiative to collect, preserve, and share molecular dynamics (MD) simulation data. As part of this project, **MDposit** is an open platform that provides web access to atomistic MD simulations. Its goal is to facilitate and promote data sharing within the global scientific community to advance research. | ||
|
|
||
| The MDposit infrastructure is distributed across several MDposit nodes. All metadata are accessible through the global node: | ||
|
|
||
| MDposit MMB node: | ||
|
|
||
| - web site: <https://mdposit.mddbr.eu/> | ||
| - documentation: <https://mdposit.mddbr.eu/#/help> | ||
| - API: <https://mdposit.mddbr.eu/api/rest/docs/> | ||
| - API base URL: <https://mdposit.mddbr.eu/api/rest/v1> | ||
|
|
||
| No account / token is needed to access the MDposit API. | ||
|
|
||
| ## Getting metadata | ||
|
|
||
| ### Datasets | ||
|
|
||
| In MDposit, a dataset (a simulation and its related files) is called a "[project](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary)". | ||
|
|
||
| API entrypoint to get the total number of projects: | ||
|
|
||
| - Endpoint: `/projects/summary` | ||
| - HTTP method: GET | ||
| - [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects_summary) | ||
|
|
||
| A project can contain multiple replicas, each identified by `project_id`.`replica_id`. | ||
|
|
||
| For example, the project [MD-A003ZP](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview) contains ten replicas: | ||
|
|
||
| - `MD-A003ZP.1`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/overview | ||
| - `MD-A003ZP.2`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.2/overview | ||
| - `MD-A003ZP.3`: https://mdposit.mddbr.eu/#/id/MD-A003ZP.3/overview | ||
| - ... | ||
|
|
||
| API entrypoint to get all datasets at once: | ||
|
|
||
| - Endpoint: `/projects` | ||
| - HTTP method: GET | ||
| - [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/projects/get_projects) | ||
|
|
||
| ### Files | ||
|
|
||
| API endpoint to get files for a given replica of a project: | ||
|
|
||
| - Endpoint: `/projects/{project_id.replica_id}/filenotes` | ||
| - HTTP method: GET | ||
| - [documentation](https://mdposit.mddbr.eu/api/rest/docs/#/filenotes/get_projects__projectAccessionOrID__filenotes) | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Project `MD-A003ZP` | ||
|
|
||
| Title: | ||
|
|
||
| > MDBind 3x1k | ||
|
|
||
| Description: | ||
|
|
||
| > 10 ns simulation of 1ma4m pdb structure from MDBind dataset, a dynamic view of the PDBBind database | ||
|
|
||
| - [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP/overview) | ||
| - [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP) | ||
|
|
||
| Files for replica 1: | ||
|
|
||
| - [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A003ZP.1/files) | ||
| - [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A003ZP.1/filenotes) | ||
|
|
||
| ### Project `MD-A001T1` | ||
|
|
||
| Title: | ||
|
|
||
| > All-atom molecular dynamics simulations of SARS-CoV-2 envelope protein E in the monomeric form, C4 popc | ||
|
|
||
| Description: | ||
|
|
||
| > The trajectories of all-atom MD simulations were obtained based on 4 starting representative conformations from the CG simulation. For each starting structure, there are six trajectories of the E protein: 3 with the protein embedded in the membrane containing POPC, and 3 with the membrane mimicking the natural ERGIC membrane (Mix: 50% POPC, 25% POPE, 10% POPI, 5% POPS, 10% cholesterol). | ||
|
|
||
| - [project on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1/overview) | ||
| - [project on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1) | ||
|
|
||
| Files for replica 1: | ||
|
|
||
| - [files on MDposit GUI](https://mdposit.mddbr.eu/#/id/MD-A001T1.1/files) | ||
| - [files on MDposit API](https://mdposit.mddbr.eu/api/rest/current/projects/MD-A001T1.1/filenotes) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -3,9 +3,16 @@ | |||||
| import re | ||||||
| from typing import Annotated | ||||||
|
|
||||||
| from pydantic import BaseModel, ConfigDict, Field, StringConstraints, field_validator | ||||||
| from pydantic import ( | ||||||
| BaseModel, | ||||||
| ConfigDict, | ||||||
| Field, | ||||||
| StringConstraints, | ||||||
| field_validator, | ||||||
| model_validator, | ||||||
| ) | ||||||
|
|
||||||
| from .enums import ExternalDatabaseName | ||||||
| from .enums import ExternalDatabaseName, MoleculeType | ||||||
|
|
||||||
| DOI = Annotated[ | ||||||
| str, | ||||||
|
|
@@ -37,6 +44,30 @@ class ExternalIdentifier(BaseModel): | |||||
| None, min_length=1, description="Direct URL to the identifier into the database" | ||||||
| ) | ||||||
|
|
||||||
| @model_validator(mode="after") | ||||||
| def compute_url(self) -> "ExternalIdentifier": | ||||||
| """Compute the URL for the external identifier. | ||||||
|
|
||||||
| Parameters | ||||||
| ---------- | ||||||
| self: ExternalIdentifier | ||||||
| The model instance being validated, with all fields already validated. | ||||||
|
|
||||||
| Returns | ||||||
| ------- | ||||||
| ExternalIdentifier | ||||||
| The model instance with the URL field computed if it was not provided. | ||||||
| """ | ||||||
| if self.url is not None: | ||||||
| return self | ||||||
|
|
||||||
| if self.database_name == ExternalDatabaseName.PDB: | ||||||
| self.url = f"https://www.rcsb.org/structure/{self.identifier}" | ||||||
| elif self.database_name == ExternalDatabaseName.UNIPROT: | ||||||
| self.url = f"https://www.uniprot.org/uniprotkb/{self.identifier}" | ||||||
|
|
||||||
| return self | ||||||
|
|
||||||
|
|
||||||
| class Molecule(BaseModel): | ||||||
| """Molecule in a simulation.""" | ||||||
|
|
@@ -45,18 +76,25 @@ class Molecule(BaseModel): | |||||
| model_config = ConfigDict(extra="forbid") | ||||||
|
|
||||||
| name: str = Field(..., description="Name of the molecule.") | ||||||
| type: MoleculeType | None = Field( | ||||||
| None, | ||||||
| description="Type of the molecule." | ||||||
| "Allowed values in the MoleculeType enum. " | ||||||
|
||||||
| "Allowed values in the MoleculeType enum. " | |
| " Allowed values in the MoleculeType enum. " |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.