genestack · tropnikovvl · Apr 9, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 27, 2026
@@ -0,0 +1,98 @@
+# Single-Cell HDF5 Transformations Overview
+
+This transformation converts a single-cell HDF5 file into the ODM-compatible output files. It extracts expression data and related cell metadata, and can optionally harmonize metadata and create or update biosample objects in ODM. The output files are then imported and linked automatically.
+
+The result is feature-level indexed data that is ready for downstream analysis and cross-study discovery without manual file preparation.
+
+## The ODM entity model for single-cell data
+
+Understanding the transformation requires familiarity with how ODM represents single-cell experiments. ODM organises data around a hierarchy of entities:
+
+- **Sample, Library, and Preparation groups** (collectively referred to as SLP) represent the biological and experimental context of the data. A Sample describes a biological specimen; a Library describes the sequencing library prepared from it; a Preparation describes a preparation step. These entities already exist in ODM for most studies, or can be created by the transformation itself.
+
+- **A Cell Group** represents the collection of individual cells from an experiment, together with their metadata. Each Cell Group must be linked to exactly one parent SLP entity (a Sample, Library, or Preparation group). This linkage is what allows ODM to associate cell-level observations with the correct experimental context.
+
+- **An Expression Group** represents the gene-by-cell expression matrix, compressed for efficient retrieval, together with computed dataset statistics. An Expression Group is always linked to a Cell Group.
+
+The transformation creates the Cell Group and Expression Group and links them into the existing (or newly created) SLP structure. This is why the configuration requires specifying how the resulting Cell Group should be connected to its parent — the linking step is fundamental to how ODM organises and queries the data.
+
+## What the transformation reads from the source file
+
+The transformation extracts three types of data from a HDF5 source file:
+
+**Cell metadata** — extracted primarily from the `obs` in H5AD input file, or the equivalent structure in 10x H5 input. This includes per-cell annotations such as barcodes, cluster assignments, quality control metrics, and any other experimental annotations. Multidimensional representations stored in `obsm` (such as PCA or UMAP coordinates) and pairwise cell annotations from `obsp` can also be extracted.
+
+**Feature metadata** — extracted from `var`, and optionally from `varm` and `varp`. This includes per-gene annotations such as gene identifiers and gene names. For supported species, the transformation can also map Ensembl or NCBI gene identifiers to gene names automatically (see [Gene ID to name mapping](attribute-mapping.md#gene-id-to-name-mapping)).
+
+**The expression matrix** — extracted from `X`, which contains count or normalized expression values. The transformation validates the matrix dimensions against the extracted cell and feature metadata, then writes the matrix in a Brotli-compressed format optimized for ODM ingestion.
+
+## The role of metadata curation
+
+Metadata curation is optional, but strongly recommended. It standardizes cell metadata so that it can be imported, linked, and indexed correctly in ODM. Certain fields must use the expected names and data types to ensure consistent linking and indexing. The transformation handles this for the user during processing. 
+
+As part of curation, the transformation performs automatic attribute mapping: commonly used attribute names from tools such as Seurat, Scanpy, or Cell Ranger are recognized and renamed to the canonical ODM API names without any configuration. Automatic attribute mapping helps harmonizing metadata across datasets, which is essential for cross-study search and downstream analysis. Attributes that do not match any known name are retained and their names are automatically converted to camelCase for consistency with the ODM naming convention. For the full list of recognized names, see the [Attribute Mapping Reference](attribute-mapping.md). 
+
+Curation is applied only to the data produced by the transformation for import into ODM. The source file is not modified.
+
+## Biosample metadata and the aggregation model
+
+Some single-cell datasets store tissue, disease, or other biosample-level attributes in cell metadata, repeating the same values for every cell. The transformation can aggregate these attributes into related biosample object: Sample, Library, or Preparation (SLP) objects in ODM.
+
+Aggregation is performed by grouping cells using a designated biosample identifier. Only attributes that are consistent across all cells in the same biosample can be assigned to related biosample objects.
+
+Attributes assigned to biosample objects are automatically removed from the cell metadata. This reduces duplication and improves the overall structure of the imported data.
+
+## Linking created objects
+
+When the transformation uploads a Cell Group, it links it to a parent Sample, Library, or Preparation entity (SLP).
+
+This is usually handled automatically. If the transformation creates new SLP objects, the Cell Group is linked to them. Otherwise, the transformation identifies the most appropriate existing SLP target in ODM. Users can override the automatic behavior by specifying the target explicitly in the configuration.
+For details, see [Linking group determination](transformation-process-reference.md#13-linking-group-determination).
+
+The Expression Group created by the transformation is linked to the corresponding Cell Group .
+
+## Dry run mode
+
+Dry run mode lets users validate the transformation setup before running a full import. In this mode, the transformation performs the initial processing steps, including reading the input, extracting metadata, applying curation, and running validation checks. It skips the most time-consuming output-generation steps, such as creating the expression matrix, and does not upload data to ODM.
+
+Dry run mode is useful for checking that the configuration works as expected and that the required inputs, metadata mappings, and linkage settings are resolved correctly before a full run.
+
+When `biosample_metadata` is configured without any `columns_to_export` entries, dry run mode can also be used to inspect which attributes are uniform within each biosample and therefore eligible for re-assigning.
+
+The recommended approach is to iterate on the configuration using dry runs until warnings are resolved, and then run the full transformation. For details, see [How to iterate on a configuration using dry runs](how-to-sc-hdf5-transformations.md#how-to-iterate-on-a-configuration-using-dry-runs).
+
+## Processors Controller API: configurations, images, and jobs
+
+The transformation is managed through the ODM Processors Controller API. It is based on three related components: configurations, images, and jobs.
+
+**Transformation configurations** are JSON documents that define how input files should be processed, including the input format, metadata extraction, and curation rules. Configurations can be created, retrieved, and updated independently of any particular run. The same configuration can be reused across multiple files with the same structure.
+
+**Transformation images** are versioned container images that run the processing logic. Available image versions can be queried through the API. The image used for single-cell HDF5 files is `hdf5-cells`. When starting a job, users can specify either `latest` or a specific release tag.
+
+**Transformation jobs** are the execution records. A job combines a configuration, an image, and one or more input files, runs the transformation, and produces the output and logs. Jobs are independent, so the same input can be run again with a different configuration or image when needed.
+
+## Transformation logs
+
+Each transformation job produces a log that records the processing steps, warnings, detected issues, and created outputs. The log also includes provenance information, such as the source file name and accession, and the accessions of the created objects.
+As part of the transformation, the log is uploaded to ODM and stored with the study as an attachment alongside the other generated files. This provides a persistent record of the transformation output. Logs are also available through the API for a limited time. By default, this retention period is two weeks.
+
+## Supported input formats
+
+The transformation supports the following HDF5-based input formats:
+
+- **H5AD (AnnData)** — the native format of the AnnData Python library, widely used for single-cell data processing.
+- **10x Genomics H5** — converted internally to H5AD before processing, so the same extraction workflow is used regardless of the input format.
+- **Legacy 10x Genomics H5 (v<3)** — supported only for files containing a single genome. Multi-genome legacy files are not supported.
+
+## Known limitations
+
+Currently, only one transformation process can be run per attachment. If there is a need to run another transformation job on the same data, a new copy of attachment should be imported or a new study should be created.
+
+## See also
+
+- [Single-cell data in ODM: Getting Started](quickstart-sc.md) - quick start tutorial for working with single-cell data.
+- [How-to Guides](how-to-sc-hdf5-transformations.md) — step-by-step guidance for running the transformation.
+- [Configuration Reference](configuration-reference.md) — full configuration schema.
+- [Transformation Process Reference](transformation-process-reference.md) — internal processing pipeline.
+- [API Reference](api-reference.md) — API endpoints.
+
@@ -0,0 +1,256 @@
+# API Reference: Single-Cell HDF5 Transformation (Processors Controller)
+
+> **Related documentation:** For conceptual background on configurations, images, and jobs, see [About Single-Cell HDF5 Transformations in ODM](about-sc-hdf5-transformations.md). For step-by-step usage of these endpoints, see the [Single-cell data in ODM: Getting Started](quickstart-sc.md) and [How-to Guides](how-to-sc-hdf5-transformations.md). For the configuration `data` object schema, see the [Configuration Reference](configuration-reference.md).
+
+This reference describes all endpoints in the ODM Processors Controller API used to manage and execute single-cell HDF5 transformations. Endpoints are grouped into three resources: Transformation Configurations, Transformation Images, and Transformation Jobs.
+
+---
+
+## Quick Reference
+
+| Operation | Method | Endpoint |
+|---|---|---|
+| List configurations | `GET` | `/api/v1/transformations/configurations` |
+| Get a configuration | `GET` | `/api/v1/transformations/configurations/{id}` |
+| Create a configuration | `POST` | `/api/v1/transformations/configurations` |
+| Update a configuration | `PUT` | `/api/v1/transformations/configurations/{id}` |
+| List images | `GET` | `/api/v1/transformations/images` |
+| Submit a job | `POST` | `/api/v1/transformations/jobs` |
+| Get job status | `GET` | `/api/v1/transformations/jobs/{id}` |
+| Retrieve job logs | `POST` | `/api/v1/transformations/jobs/{id}/logs` |
+
+---
+
+## Transformation Configurations
+
+A transformation configuration is a stored JSON document that defines how a source file should be processed. It contains a human-readable name and description alongside the `data` object, which is the full processing specification passed to the transformation image.
+
+Configurations are independent of any particular run. The same configuration can be reused across multiple jobs and updated iteratively without affecting previous job results.
+
+### List configurations
+
+```
+GET /api/v1/transformations/configurations
+```
+
+Returns an array of configuration objects. Each entry includes:
+
+| Field | Type | Description |
+|---|---|---|
+| `id` | integer | Unique identifier for the configuration |
+| `name` | string | Human-readable name |
+| `description` | string | Human-readable description |
+
+Use this endpoint to discover existing configurations before deciding to create a new one or reuse an existing one.
+
+### Get a configuration
+
+```
+GET /api/v1/transformations/configurations/{id}
+```
+
+Returns the full configuration object, including the `data` field with all processing rules. Use this to inspect an existing configuration before deciding to update or reuse it.
+
+**Path parameters:**
+
+| Parameter | Type | Description |
+|---|---|---|
+| `id` | integer | ID of the configuration to retrieve |
+
+### Create a configuration
+
+```
+POST /api/v1/transformations/configurations
+```
+
+Creates a new transformation configuration and returns its assigned `id`.
+
+**Request body:**
+
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `name` | string | Yes | Human-readable name for this configuration |
+| `description` | string | Yes | Human-readable description |
+| `data` | object | Yes | The processing specification. See the [Configuration Reference](configuration-reference.md) for the full schema. |
+
+**Example request body:**
+
+```json
+{
+  "name": "minimal_config",
+  "description": "Minimal transformation config for H5AD files",
+  "data": {
+    "file_type": "h5ad",
+    "biosample_metadata": null,
+    "cell_metadata": {
+      "metadata_keys": {
+        "obs": "metadata"
+      }
+    },
+    "feature_metadata": {
+      "metadata_keys": {
+        "var": "metadata"
+      }
+    },
+    "cell_expression": {
+      "data_class": "Single-cell transcriptomics"
+    }
+  }
+}
+```
+
+**Response:** The response object includes the `id` assigned to the new configuration. This `id` is required when submitting a job.
+
+### Update a configuration
+
+```
+PUT /api/v1/transformations/configurations/{id}
+```
+
+Fully replaces the configuration at the given `id` with the provided content. 
+
+**Path parameters:**
+
+| Parameter | Type | Description |
+|---|---|---|
+| `id` | integer | ID of the configuration to update |
+
+**Request body:** Same structure as `POST /api/v1/transformations/configurations`.
+
+---
+
+## Transformation Images
+
+A transformation image is a versioned, containerized processing environment that executes the transformation logic for a specific input format. Images are managed separately from configurations, enabling version-controlled upgrades.
+
+### List images
+
+```
+GET /api/v1/transformations/images
+```
+
+Returns an array of available image objects.
+
+**Response fields per image:**
+
+| Field | Description |
+|---|---|
+| `name` | Identifier used when referencing the image in a job (e.g. `"hdf5-cells"`) |
+| `description` | Human-readable description of the image's purpose |
+| `input_formats` | File formats accepted as input |
+| `output_formats` | File formats produced as output |
+| `version` | Version tag (e.g. `"latest"` or a specific release tag such as `"0.0.7"`) |
+
+Use this endpoint to confirm image availability and identify the version to specify when submitting a job.
+
+---
+
+## Transformation Jobs
+
+A transformation job binds a configuration and an image to one or more input file accessions and executes the processing pipeline. Each job produces an execution log and, when not in dry-run mode, creates or updates ODM objects.
+
+### Submit a job
+
+```
+POST /api/v1/transformations/jobs
+```
+
+Creates and submits a new transformation job. The response includes the `id` of the created job, which is required for status and log queries.
+
+**Request body:**
+
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `configuration_id` | integer | Yes | ID of the transformation configuration to use |
+| `dry_run` | boolean | Yes | `true` to simulate the run without writing data to ODM; `false` for a full run |
+| `image_reference` | object | Yes | Specifies the image to use. Contains `name` (string) and `version` (string). |
+| `input_accessions` | array of strings | Yes | ODM accessions of the input files to process |
+| `volume_size` | integer | Yes | Scratch volume size in GB allocated for the job |
+
+**`image_reference` fields:**
+
+| Field | Type | Description |
+|---|---|---|
+| `name` | string | Image name. Use `"hdf5-cells"` for single-cell HDF5 transformations. |
+| `version` | string | Version tag. Use `"latest"` or a specific release tag (e.g. `"0.0.7"`). |
+
+**`volume_size` guidelines:**
+
+| Input format | Recommended `volume_size` |
+|---|---|
+| H5AD | ≥ 1.4 × size of the original attachment (GB) |
+| 10x H5 | ≥ 4 × size of the original attachment (GB) |
+
+H5 files require significantly more scratch space due to the internal conversion to H5AD format.
+
+**Example request body (dry run):**
+
+```json
+{
+  "configuration_id": 42,
+  "dry_run": true,
+  "image_reference": {
+    "name": "hdf5-cells",
+    "version": "latest"
+  },
+  "input_accessions": ["GSF020408"],
+  "volume_size": 30
+}
+```
+
+**Example request body (full run):**
+
+```json
+{
+  "configuration_id": 42,
+  "dry_run": false,
+  "image_reference": {
+    "name": "hdf5-cells",
+    "version": "latest"
+  },
+  "input_accessions": ["GSF020408"],
+  "volume_size": 30
+}
+```
+
+### Get job status
+
+```
+GET /api/v1/transformations/jobs/{id}
+```
+
+Returns the job object, including the current `status.state`.
+
+**Path parameters:**
+
+| Parameter | Type | Description |
+|---|---|---|
+| `id` | integer | ID of the job to query |
+
+**`status.state` values:**
+
+| State | Meaning |
+|---|---|
+| `RUNNING` | Job is in progress |
+| `DONE` | Job finished successfully |
+| `FAILED` | Job encountered an error |
+
+### Retrieve job logs
+
+```
+POST /api/v1/transformations/jobs/{id}/logs
+```
+
+Returns the log records for the specified job. Logs include:
+
+- Configuration validation messages.
+- Input file structure report (keys, data types, shapes, attribute names).
+- Warnings and errors encountered during metadata extraction and curation.
+- Linking validation results (dry-run only).
+- Accessions of ODM objects created or updated (full run only).
+
+**Path parameters:**
+
+| Parameter | Type | Description |
+|---|---|---|
+| `id` | integer | ID of the job whose logs to retrieve |