enh: allow creation of dandiset dois (contrasted to a version doi) #297

asmacdo · 2025-04-22T18:03:27Z

No description provided.

dandischema/datacite/__init__.py

asmacdo · 2025-04-22T19:46:57Z

https://github.com/dandi/dandi-archive/pull/2350/files#r2054748163

If version_doi is False, we should point to the DLP instead of a specific version, which will prevent needing additional updates.

asmacdo · 2025-05-08T22:23:26Z

dandischema/models.py

+DANDI_ID_PATTERN = r"\d{6}"
+VERSION_PATTERN = rf"{DANDI_ID_PATTERN}/\d+\.\d+\.\d+"
+DANDI_DOI_WITH_VERSION = rf"^10.(48324|80507)/dandi\.{VERSION_PATTERN}"
+DANDI_DOI_NO_VERSION = r"^10\.(48324|80507)/dandi\.\d{6}"
+DANDI_DOI_PATTERN = rf"{DANDI_DOI_WITH_VERSION}|{DANDI_DOI_NO_VERSION}"
 DANDI_PUBID_PATTERN = rf"^DANDI:{VERSION_PATTERN}"
+
+PUBLISHED_DANDISET_URL_PATTERN = (
+    rf"^{DANDI_INSTANCE_URL_PATTERN}/dandiset/{DANDI_ID_PATTERN}"
+)
 PUBLISHED_VERSION_URL_PATTERN = (
    rf"^{DANDI_INSTANCE_URL_PATTERN}/dandiset/{VERSION_PATTERN}$"
 )
+PUBLISHED_URL_PATTERN = (
+    rf"{PUBLISHED_VERSION_URL_PATTERN}|{PUBLISHED_DANDISET_URL_PATTERN}"
+)


TODO: this has gotten really messy, it can be cleaned up (i just hacked it out to work for now)

Some of those will be outdated soon. We are modifying some of them in #294. The goal for that PR is not to clean the mess though.

@candleindark thanks for lookin out, nice catch!

I wonder if the model changes are still necessary at all? I've gone through several implementations that were trying to pass all the data through Pydantic validation first (currently we are going PublishedDandiset or unvalidated), and separately we've also made the decision to store the dandiset-wide doi on the draft version. If we aren't going to soften the Dandiset pydantic model so we can use that for a draft dandiset, I'm not sure if the draft could ever pass PublishedDandiset validation. So there isnt really a need to change the regex to accept that other format of doi...

yarikoptic

avoid creating new classes if feasible

dandischema/datacite/__init__.py

asmacdo · 2025-05-29T17:30:24Z

For to_datacite we are now attempting to use a PublishedDandiset but falling back to an unvalidated model. In practice, this means that any Pydantic changes will not affect this very much because if the validation fails we will just fall back.

candleindark · 2025-05-29T21:39:41Z

For to_datacite we are now attempting to use a PublishedDandiset but falling back to an unvalidated model. In practice, this means that any Pydantic changes will not affect this very much because if the validation fails we will just fall back.

This also means that we will get unvalidated data. Is it possible to think of another approach? I am aware that there are already existing uses of BaseModel.model_construct() in the project. However, we are trying to eliminate these uses since they are really headaches. They make one can't trust the validity of Pydantic model objects.

If the goal is to create a DOI for a dandiset, encompassing all versions, you may need less information than what is contained in a PublishedDandiset object. If that's that case, defining a separate model that contains the minimal set of information and using the model to validate the information to create the DOI may be a better way to go.

asmacdo · 2025-05-29T22:56:14Z

@candleindark I dont think its as bad to be unvalidated here as it might seem, but it does seem like we are bypassing the value of using the Pydantic models for this. Normal validation is still going on under the hood for Dandi itself, this is just executed after the data has been saved to the db, just validating the data we are sending to Datacite. If that doesnt conform to our models, thats ok, and if it doesnt conform to their spec, we will just log and move on.

candleindark · 2025-06-02T06:15:45Z

@candleindark I dont think its as bad to be unvalidated here as it might seem, but it does seem like we are bypassing the value of using the Pydantic models for this. Normal validation is still going on under the hood for Dandi itself, this is just executed after the data has been saved to the db, just validating the data we are sending to Datacite. If that doesnt conform to our models, thats ok, and if it doesnt conform to their spec, we will just log and move on.

My concern is that once model_construct() is used to construct model objects that are unknown to be invalid. I can no longer trust the objects as they are typed. model_construct() really should be used to create "a new instance of the Model class with validated data". (See https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_construct and https://docs.pydantic.dev/latest/concepts/models/#creating-models-without-validation). If you call model_construct() with data that is invalid/unvalidated data, what you get in return is just a dictionary with those data with the addition of default values for some fields that your data don't specify and without the key-value pairs that don't correspond to a field in the model.

Assuming that you are calling the model_construct() to populate some fields with the corresponding default values and filter out key-value pairs that don't correspond to a field in the model, you may want to consider following method illustrated in the following example instead.

from pydantic import BaseModel

class Bar(BaseModel):
    s: str

class Foo(BaseModel):
    bar: Bar = Bar(s="default string")

    x: int

# `model_construct()` is called to return an intermediate result so that the result
# is never treated as a `Foo` instance by other code. Calling dict() with result of
# `model_construct()` returns the raw field values of the result (including the
# default values and excluding the extra values).
f_dict: dict = dict(Foo.model_construct(x=42, y=3))

print(f_dict)
"""
{'bar': Bar(s='default string'), 'x': 42}
"""

In this example, there is no (potentially) invalid object. I think you can use this method in your construct_unvalidated_dandiset(), but you will have to change the return type to dict though.

- deprecate to_datacite(publish) in favor of event - If PublishedDandiset validation fails, fall back to unvalidated Dandiset

… datacite API

codecov · 2025-06-12T18:47:03Z

Codecov Report

Attention: Patch coverage is 89.71963% with 11 lines in your changes missing coverage. Please review.

Project coverage is 94.13%. Comparing base (6ac0414) to head (53adc4b).

Files with missing lines	Patch %	Lines
dandischema/datacite/__init__.py	82.00%	9 Missing ⚠️
dandischema/datacite/tests/test_datacite.py	96.49%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #297      +/-   ##
==========================================
- Coverage   97.88%   94.13%   -3.75%     
==========================================
  Files          16       16              
  Lines        1983     2080      +97     
==========================================
+ Hits         1941     1958      +17     
- Misses         42      122      +80

Flag	Coverage Δ
unittests	`94.13% <89.71%> (-3.75%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

construct_unvalidated_dandiset was reorganized to use a dict for the inner portions *prior* to initializing as a model. This allows mypy to understand the expected types (otherwise we need to type ignore most of it)

asmacdo · 2025-06-17T17:14:50Z

dandischema/datacite/__init__.py

+        except ValidationError:
+            # mypy can't track that meta is still dict after failed PublishedDandiset(**meta)
+            assert isinstance(meta, dict)
+            if meta.get("version") == "draft":


if the version is a draft, it wont have fields like datePublished. However this can also happen when we are creating a Dandiset DOI from a published version-- in this case, the metadata is the published version, but the doi and the url fields wont pass validation (they both won't include the version).

Previously I modified the PublishedDandiset to accept either format for url and doi, but I dont think that really makes sense-- those aren't valid for a published dandiset, and we wouldnt want our output schema to reflect flexibility thats not really there.

asmacdo mentioned this pull request Apr 22, 2025

Introducing Dandiset DOIs dandi/dandi-archive#2350

Closed

19 tasks

yarikoptic reviewed Apr 22, 2025

View reviewed changes

dandischema/datacite/__init__.py Show resolved Hide resolved

yarikoptic mentioned this pull request Apr 30, 2025

Design document for the Zenodo like DOI per dandiset dandi/dandi-archive#2012

Merged

4 tasks

asmacdo force-pushed the enh-dandiset-dois branch from baeb839 to 2a7eaac Compare May 8, 2025 22:13

asmacdo commented May 8, 2025

View reviewed changes

asmacdo force-pushed the enh-dandiset-dois branch from edd18b6 to e6ebaa2 Compare May 13, 2025 21:47

yarikoptic requested changes May 14, 2025

View reviewed changes

dandischema/datacite/__init__.py Outdated Show resolved Hide resolved

dandischema/datacite/__init__.py Outdated Show resolved Hide resolved

asmacdo force-pushed the enh-dandiset-dois branch from 120c277 to d6f4848 Compare May 23, 2025 17:38

asmacdo added 6 commits June 10, 2025 15:52

enh: allow creation of dandiset-wide dois

cc85db1

- deprecate to_datacite(publish) in favor of event - If PublishedDandiset validation fails, fall back to unvalidated Dandiset

get non-datacite-api tests working

997bb88

parameterize similar event tests

ad6a89f

unskip a test that doesnt require datacite password

334eeaf

reduce log level for normal operation, add info for unexpected operation

d530322

Add draft dandiset metadata fixture and use for test with and without…

9bf653f

… datacite API

asmacdo force-pushed the enh-dandiset-dois branch 2 times, most recently from 67fb5ef to 942f616 Compare June 12, 2025 18:47

cleanup comments

9b28605

asmacdo force-pushed the enh-dandiset-dois branch 2 times, most recently from 2465bbd to 278f7c9 Compare June 12, 2025 21:43

fixup: type checking

53adc4b

construct_unvalidated_dandiset was reorganized to use a dict for the inner portions *prior* to initializing as a model. This allows mypy to understand the expected types (otherwise we need to type ignore most of it)

asmacdo force-pushed the enh-dandiset-dois branch from 278f7c9 to 53adc4b Compare June 12, 2025 21:52

asmacdo commented Jun 17, 2025

View reviewed changes

kabilar requested review from jjnesbitt and waxlamp July 14, 2025 18:59

kabilar requested a review from mvandenburgh July 14, 2025 18:59

enh: allow creation of dandiset dois (contrasted to a version doi) #297

Are you sure you want to change the base?

enh: allow creation of dandiset dois (contrasted to a version doi) #297

Uh oh!

Conversation

asmacdo commented Apr 22, 2025

Uh oh!

Uh oh!

asmacdo commented Apr 22, 2025

Uh oh!

asmacdo May 8, 2025

Choose a reason for hiding this comment

Uh oh!

candleindark May 9, 2025

Choose a reason for hiding this comment

Uh oh!

asmacdo May 9, 2025

Choose a reason for hiding this comment

Uh oh!

asmacdo May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarikoptic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

asmacdo commented May 29, 2025

Uh oh!

candleindark commented May 29, 2025

Uh oh!

asmacdo commented May 29, 2025

Uh oh!

candleindark commented Jun 2, 2025

Uh oh!

codecov bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

asmacdo Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asmacdo May 29, 2025 •

edited

Loading

codecov bot commented Jun 12, 2025 •

edited

Loading