diff --git a/docs/source/index.rst b/docs/source/index.rst index c3ac6f2..b18a309 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -66,6 +66,7 @@ Index overview installation usage + metadata_source_spec pipeline_description metadata_formats catalog_schema diff --git a/docs/source/metadata_source_spec.rst b/docs/source/metadata_source_spec.rst new file mode 100644 index 0000000..64493fd --- /dev/null +++ b/docs/source/metadata_source_spec.rst @@ -0,0 +1,138 @@ +Metadata source specification +***************************** + +This metadata source specification defines how to structure a collection of metadata records +that together form the source material for a ``datalad-catalog`` catalog instance. + +The specification benefits both users and developers in that it separates metadata formats +from the tooling that processes it: + +- users can create and maintain such specification-compliant metadata collections without + having to employ ``datalad-catalog`` tooling +- both generic and format-specific tooling can be developed and deployed, either as part of + ``datalad-catalog`` or as custom extensions, to transform specification-compliant metadata + collections into a state renderable by a catalog + + +High-level design +================= + +The metadata source specification supports: + +1. **Per-catalog versioned customizations**: the top-level functional unit of the source + specification is a catalog instance, which can be customized via a versioned configuration + file as defined in the section :doc:`catalog_config`. This means a specification-compliant + collection of records can specify the (version-specific) "look and feel" of a catalog, + in addition to its displayed content. +2. **Multi-dataset, multi-version records**: the source specification has a filesystem layout + with a directory for each unique dataset identifier, which in turn has a subdirectory for + each unique version identifier of a given dataset. This ensures a modular setup within which + records for multiple versions of the same dataset can coexist. +3. **Multi-format metadata records**: the specification places no restrictions on the number + and type of metadata records in a collection for a given dataset version, since in reality + metadata often originate from a variety of sources and exist in a variety of formats. + The transformation of different record formats into ``datalad-catalog``-compatible records + is conveniently shifted into the tooling domain, and is not part of the specification itself. + + +The specification +================= + +The following filesystem layout and record naming scheme should be adhered to for +a given collection of records: + +.. code-block:: + + . + ├── config/ + │ └── / + │ └── config.json + └── records/ + └── / + ├── config.json + └── / + └── + + +``config/`` +----------- + +This directory should contain the catalog-level configuration file(s), one per version, +with the name ``config.json``. + +```` +----------------------- + +This directory name specifies the version of the configuration file, +and should have a unique string value. + +``records/`` +------------ + +All metadata records for all versions of all datasets should be placed in the appropriate +relative location within this directory. + + +``/`` +----------------- + +All metadata records for all versions of *a specific dataset* should be placed in this +directory. ```` should be a unique string identifying the dataset, avoiding +white space and special characters. + + +``/`` +------------------------- + +All metadata records for *a specific version* of *a specific dataset* should be placed +in this directory. ```` should be a unique string identifying the version, +avoiding white space and special characters. + +```` +--------------- + +This should be a unique filename of a single record, with identifying characters that +can be parsed in order to match the specific file format with a specific reader or processing +tool. There is no restriction on the number of files contained in a given ```` +directory, they should just all be unique. + + +An example +========== + +This is an example record collection: + +.. code-block:: + + . + ├── config/ + │ ├── v1/ + │ │ └── config.json + │ └── v2/ + │ └── config.json + └── records/ + └── myDatasetA/ + │ ├── v0.1.1/ + │ │ └── datacite.json + │ └── v0.1.2/ + │ ├── studyminimeta.yaml + │ └── datacite.json + └── myDatasetB/ + ├── config.json + └── latest/ + ├── dataset_description.json + ├── tabby.tsv + ├── data-package.json + ├── LICENSE + └── citations.cff + + +.. note:: + + **TO DO**: Construct and point to an actual specification-compliant collection of records + + +.. note:: + + **TO DO**: Point to the toolset description of how such a collection can be transformed + into a set of ``datalad-catalog``-compatible records \ No newline at end of file diff --git a/docs/source/pipeline_description.rst b/docs/source/pipeline_description.rst index f3ebf6e..11ef928 100644 --- a/docs/source/pipeline_description.rst +++ b/docs/source/pipeline_description.rst @@ -1,6 +1,15 @@ Pipeline Description ******************** +.. warning:: + + This section describes a functioning but outdated view of generating a catalog + entry from a DataLad dataset using ``datalad-metalad`` extractors and + ``datalad-catalog`` translators. This will soon be updated to suggest a + metadata ingestion pipeline using the :doc:`metadata_source_spec` and + dedicated toolset. + + The DataLad ecosystem provides a complete set of free and open source tools that, together, provide full control over dataset access and distribution, version control, provenance tracking, metadata addition, extraction, and