Skip to content

Conversation

@yarikoptic
Copy link
Member

This is just an initial attempt open for discussion. @satra please chime in

I ran into "blobDateModified" in a zarr metadata and it raised my eyebrow since that is not really appropriate and confusing. Hence I decided to look into generalization. I also thought that it would be valuable to make "type" of the content Asset points to explicit, although that could lead to inconsistencies since information is somewhat redundant with encodingFormat and potentially could also be deduced from contenUrl since we have different end points on S3, etc.

Nevertheless I think it might be better to make it explicit. Or at least we have to rename blobDateModified.

  • ContenType name is quite suboptimal since there is a standard HTTP header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type and thus we could potential confusion.

    But we should keep it a "Type" (not e.g. a Class) to be consistent with other type definitions among models.

    So the other part we could try to vary is "Content". Possible alternatives are "Object", "Data", "Resource"

  • ATM we call all Zarrs just Zarr but it is a "ZarrFolder" really. I wonder if it would be time to start to introduce differentiation here by making it "ZarrFolder", as later we might get "ZarrHDF5" or alike

@yarikoptic yarikoptic requested a review from satra January 30, 2024 15:26
…contentDateModified

This is just an initial attempt open for discussion.

I ran into "blobDateModified" in a zarr metadata and it raised my eyebrow since
that is not really appropriate and confusing. Hence I decided to look into
generalization.  I also thought that it would be valuable to make "type"
of the content Asset points to explicit, although that could lead to
inconsistencies since information is somewhat redundant with encodingFormat and
potentially could also be deduced from contenUrl since we have different end
points on S3, etc.

Nevertheless I think it might be better to make it explicit. Or at least we
have to rename blobDateModified.

- ContenType name is quite suboptimal since there is a standard HTTP header
  https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
  and thus we could potential confusion.

  But we should keep it a "Type" (not e.g. a Class) to be consistent with
  other type definitions among models.

  So the other part we could try to vary is "Content". Possible alternatives are
  "Object", "Data", "Resource"

- ATM we call all Zarrs just Zarr but it is a "ZarrFolder" really.  I wonder if
  it would be time to start to introduce differentiation here by making it
  "ZarrFolder", as later we might get "ZarrHDF5" or alike
@codecov
Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (4423b41) 97.66% compared to head (ddd77da) 97.61%.

Files Patch % Lines
dandischema/metadata.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #220      +/-   ##
==========================================
- Coverage   97.66%   97.61%   -0.05%     
==========================================
  Files          18       18              
  Lines        1798     1806       +8     
==========================================
+ Hits         1756     1763       +7     
- Misses         42       43       +1     
Flag Coverage Δ
unittests 97.61% <90.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@satra
Copy link
Member

satra commented Feb 2, 2024

i'm fine with something like this being in the database, but not sure about changing the metadata model. perhaps we can brainstorm when we meet.

Copy link
Member

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have both encodingFormat and dataType - do we need yet another field or a clearer encodingFormat.

@yarikoptic
Copy link
Member Author

  • dataType is "something else" which nobody yet cared to fill out
dandischema/models.py-    # this is from C2M2 level 1 - using EDAM vocabularies - in our case we would
dandischema/models.py-    # need to come up with things for neurophys
dandischema/models.py-    # TODO: waiting on input <https://github.com/dandi/dandi-cli/pull/226>
dandischema/models.py:    dataType: Optional[AnyHttpUrl] = Field(
dandischema/models.py-        None, json_schema_extra={"nskey": DANDI_NSKEY}
dandischema/models.py-    )

(I see no hits among jsonld manifests ATM)

  • encodingFormat is indeed a close relative , which we ATM do not control really, and we could just start talking in those terms. But then we would need to have "wide bins" to place everything non-zarr into "blobs"
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ git grep -h encodingFormat -- *.jsonld | tr ',' '\n' | grep encodingFormat | sort | uniq -c | sort -n
      3 "encodingFormat":"application/pdf"
      4 "encodingFormat":"application/xml"
      5 "encodingFormat":"text/markdown"
      8 "encodingFormat":"video/x-ms-wmv"
     22 "encodingFormat":"text/plain"
     35 "encodingFormat":"text/csv"
     37 "encodingFormat":"application/x-sh"
     51 "encodingFormat":"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
     88 "encodingFormat":"video/quicktime"
     91 "encodingFormat":"text/x-python"
    262 "encodingFormat":"video/x-matroska"
    325 "encodingFormat":"image/jpeg"
    441 "encodingFormat":"application/vnd.wolfram.mathematica.package"
    514 "encodingFormat":"image/prs.btif"
    841 "encodingFormat":"application/gzip"
   1033 "encodingFormat":"text/tab-separated-values"
   1678 "encodingFormat":"image/png"
   1855 "encodingFormat":"video/x-msvideo"
   5030 "encodingFormat":"video/avi"
   5387 "encodingFormat":"application/x-zarr"
  11362 "encodingFormat":"video/mp4"
  22221 "encodingFormat":"image/tiff"
  27304 "encodingFormat":"application/json"
 123964 "encodingFormat":"application/octet-stream"
 250204 "encodingFormat":"application/x-nwb"

if "schemaKey" not in obj:
obj["schemaKey"] = "Dandiset"
obj["schemaVersion"] = to_version
if version2tuple(schema_version) < version2tuple("0.7.0"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here will need to become some future version now that 0.7.0 is out... we might want to introduce next branch alike to next in linux development, or more specifically next-0.8.0 and position this PR against it so we could absorb meanwhile non-breaking changes

Suggested change
if version2tuple(schema_version) < version2tuple("0.7.0"):
if version2tuple(schema_version) < version2tuple("0.8.0"):

or obj.get("path", "").endswith(".zarr")
)
else models.ContentType.Blob
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not yet 100% certain since we have only 2 cases -- it is either zarr or not (blob), but this field and code would be the "centralization" of the logic for DANDI-specific behavior of zarr-vs-blob. Otherwise, if we add some other "contentType" later on (e.g. remoteLink 🤷) such comparisons would need to be duplicated in every piece of code... we would need likely to add validation for that logic also at pydantic or linkml level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants