Skip to content

Conversation

@asmacdo
Copy link
Member

@asmacdo asmacdo commented Apr 22, 2025

No description provided.

@asmacdo
Copy link
Member Author

asmacdo commented Apr 22, 2025

https://github.com/dandi/dandi-archive/pull/2350/files#r2054748163

If version_doi is False, we should point to the DLP instead of a specific version, which will prevent needing additional updates.

Comment on lines 60 to 75
DANDI_ID_PATTERN = r"\d{6}"
VERSION_PATTERN = rf"{DANDI_ID_PATTERN}/\d+\.\d+\.\d+"
DANDI_DOI_WITH_VERSION = rf"^10.(48324|80507)/dandi\.{VERSION_PATTERN}"
DANDI_DOI_NO_VERSION = r"^10\.(48324|80507)/dandi\.\d{6}"
DANDI_DOI_PATTERN = rf"{DANDI_DOI_WITH_VERSION}|{DANDI_DOI_NO_VERSION}"
DANDI_PUBID_PATTERN = rf"^DANDI:{VERSION_PATTERN}"

PUBLISHED_DANDISET_URL_PATTERN = (
rf"^{DANDI_INSTANCE_URL_PATTERN}/dandiset/{DANDI_ID_PATTERN}"
)
PUBLISHED_VERSION_URL_PATTERN = (
rf"^{DANDI_INSTANCE_URL_PATTERN}/dandiset/{VERSION_PATTERN}$"
)
PUBLISHED_URL_PATTERN = (
rf"{PUBLISHED_VERSION_URL_PATTERN}|{PUBLISHED_DANDISET_URL_PATTERN}"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: this has gotten really messy, it can be cleaned up (i just hacked it out to work for now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of those will be outdated soon. We are modifying some of them in #294. The goal for that PR is not to clean the mess though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@candleindark thanks for lookin out, nice catch!

Copy link
Member Author

@asmacdo asmacdo May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the model changes are still necessary at all? I've gone through several implementations that were trying to pass all the data through Pydantic validation first (currently we are going PublishedDandiset or unvalidated), and separately we've also made the decision to store the dandiset-wide doi on the draft version. If we aren't going to soften the Dandiset pydantic model so we can use that for a draft dandiset, I'm not sure if the draft could ever pass PublishedDandiset validation. So there isnt really a need to change the regex to accept that other format of doi...

@asmacdo asmacdo force-pushed the enh-dandiset-dois branch from edd18b6 to e6ebaa2 Compare May 13, 2025 21:47
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid creating new classes if feasible

@asmacdo asmacdo force-pushed the enh-dandiset-dois branch from 120c277 to d6f4848 Compare May 23, 2025 17:38
@asmacdo
Copy link
Member Author

asmacdo commented May 29, 2025

For to_datacite we are now attempting to use a PublishedDandiset but falling back to an unvalidated model. In practice, this means that any Pydantic changes will not affect this very much because if the validation fails we will just fall back.

@candleindark
Copy link
Member

For to_datacite we are now attempting to use a PublishedDandiset but falling back to an unvalidated model. In practice, this means that any Pydantic changes will not affect this very much because if the validation fails we will just fall back.

This also means that we will get unvalidated data. Is it possible to think of another approach? I am aware that there are already existing uses of BaseModel.model_construct() in the project. However, we are trying to eliminate these uses since they are really headaches. They make one can't trust the validity of Pydantic model objects.

If the goal is to create a DOI for a dandiset, encompassing all versions, you may need less information than what is contained in a PublishedDandiset object. If that's that case, defining a separate model that contains the minimal set of information and using the model to validate the information to create the DOI may be a better way to go.

@asmacdo
Copy link
Member Author

asmacdo commented May 29, 2025

@candleindark I dont think its as bad to be unvalidated here as it might seem, but it does seem like we are bypassing the value of using the Pydantic models for this. Normal validation is still going on under the hood for Dandi itself, this is just executed after the data has been saved to the db, just validating the data we are sending to Datacite. If that doesnt conform to our models, thats ok, and if it doesnt conform to their spec, we will just log and move on.

@candleindark
Copy link
Member

@candleindark I dont think its as bad to be unvalidated here as it might seem, but it does seem like we are bypassing the value of using the Pydantic models for this. Normal validation is still going on under the hood for Dandi itself, this is just executed after the data has been saved to the db, just validating the data we are sending to Datacite. If that doesnt conform to our models, thats ok, and if it doesnt conform to their spec, we will just log and move on.

My concern is that once model_construct() is used to construct model objects that are unknown to be invalid. I can no longer trust the objects as they are typed. model_construct() really should be used to create "a new instance of the Model class with validated data". (See https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_construct and https://docs.pydantic.dev/latest/concepts/models/#creating-models-without-validation). If you call model_construct() with data that is invalid/unvalidated data, what you get in return is just a dictionary with those data with the addition of default values for some fields that your data don't specify and without the key-value pairs that don't correspond to a field in the model.

Assuming that you are calling the model_construct() to populate some fields with the corresponding default values and filter out key-value pairs that don't correspond to a field in the model, you may want to consider following method illustrated in the following example instead.

from pydantic import BaseModel

class Bar(BaseModel):
    s: str

class Foo(BaseModel):
    bar: Bar = Bar(s="default string")

    x: int

# `model_construct()` is called to return an intermediate result so that the result
# is never treated as a `Foo` instance by other code. Calling dict() with result of
# `model_construct()` returns the raw field values of the result (including the
# default values and excluding the extra values).
f_dict: dict = dict(Foo.model_construct(x=42, y=3))

print(f_dict)
"""
{'bar': Bar(s='default string'), 'x': 42}
"""

In this example, there is no (potentially) invalid object. I think you can use this method in your construct_unvalidated_dandiset(), but you will have to change the return type to dict though.

@asmacdo asmacdo force-pushed the enh-dandiset-dois branch 2 times, most recently from 67fb5ef to 942f616 Compare June 12, 2025 18:47
@codecov
Copy link

codecov bot commented Jun 12, 2025

Codecov Report

Attention: Patch coverage is 89.71963% with 11 lines in your changes missing coverage. Please review.

Project coverage is 94.13%. Comparing base (6ac0414) to head (53adc4b).

Files with missing lines Patch % Lines
dandischema/datacite/__init__.py 82.00% 9 Missing ⚠️
dandischema/datacite/tests/test_datacite.py 96.49% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #297      +/-   ##
==========================================
- Coverage   97.88%   94.13%   -3.75%     
==========================================
  Files          16       16              
  Lines        1983     2080      +97     
==========================================
+ Hits         1941     1958      +17     
- Misses         42      122      +80     
Flag Coverage Δ
unittests 94.13% <89.71%> (-3.75%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@asmacdo asmacdo force-pushed the enh-dandiset-dois branch 2 times, most recently from 2465bbd to 278f7c9 Compare June 12, 2025 21:43
construct_unvalidated_dandiset was reorganized to use a dict for the
inner portions *prior* to initializing as a model. This allows mypy to
understand the expected types (otherwise we need to type ignore most of it)
@asmacdo asmacdo force-pushed the enh-dandiset-dois branch from 278f7c9 to 53adc4b Compare June 12, 2025 21:52
except ValidationError:
# mypy can't track that meta is still dict after failed PublishedDandiset(**meta)
assert isinstance(meta, dict)
if meta.get("version") == "draft":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the version is a draft, it wont have fields like datePublished. However this can also happen when we are creating a Dandiset DOI from a published version-- in this case, the metadata is the published version, but the doi and the url fields wont pass validation (they both won't include the version).

Previously I modified the PublishedDandiset to accept either format for url and doi, but I dont think that really makes sense-- those aren't valid for a published dandiset, and we wouldnt want our output schema to reflect flexibility thats not really there.

@kabilar kabilar requested review from jjnesbitt and waxlamp July 14, 2025 18:59
@kabilar kabilar requested a review from mvandenburgh July 14, 2025 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants