Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
dc3bba9
Single-cell data transformation draft documents
Mar 26, 2026
8a6f027
Add different configuration reference options
Mar 26, 2026
0d0825f
Refactor: minor text edit
Mar 27, 2026
b8b90e8
Fix: Remove dry_run from configuration parameters across the files
Mar 27, 2026
c2e695d
Text edit
isabel-gomez-gs Mar 30, 2026
1d87df4
Text edit
isabel-gomez-gs Mar 30, 2026
939a6d1
Modify text describing goal of transformation
isabel-gomez-gs Mar 30, 2026
235ea23
Text edit
isabel-gomez-gs Mar 30, 2026
5062e5b
Text edit
isabel-gomez-gs Mar 30, 2026
6ff6259
Include reference to single cell data user guide
isabel-gomez-gs Mar 30, 2026
76271be
Reference linking group determination
isabel-gomez-gs Mar 30, 2026
76bdaf1
Rename section to reference specific API endpoints
isabel-gomez-gs Mar 30, 2026
5b4f6a1
Rephrase metadata and expression extraction description
isabel-gomez-gs Mar 30, 2026
66dee86
Modify SLP update description
isabel-gomez-gs Mar 30, 2026
489f0f8
Update SLP update behaviour
Mar 30, 2026
94effba
Update metadata curation description
isabel-gomez-gs Mar 30, 2026
ab066af
Update description of dry-run mode
isabel-gomez-gs Mar 30, 2026
214220d
Update Processors controller API description
Mar 30, 2026
4334a13
Merge branch 'docs/sc-transformations' of https://github.com/genestac…
Mar 30, 2026
7d72262
Create attribute mapping reference
Mar 30, 2026
56b7ca5
Select desired configuration reference option, remove unwanted files
Mar 30, 2026
95738f0
Add gene ID - gene mapping in attribute mapping reference
Mar 30, 2026
db33768
Remove supported species from gene mapping from config schema
Mar 30, 2026
beba176
Update how to guide
Mar 30, 2026
82977f4
Update transformation reference
Mar 30, 2026
729b41c
Add quick tutorial and configurations for public datasets
Mar 31, 2026
a9d46dd
Fix configuration mapping
Mar 31, 2026
8c480ca
Update links to ODM documentation
Apr 1, 2026
274ece6
Rename quickstart guide
Apr 1, 2026
432bb30
refactor: text edit
isabel-gomez-gs Apr 1, 2026
0135faf
refactor: text edit
isabel-gomez-gs Apr 1, 2026
dbdfcc4
refactor: text edit
isabel-gomez-gs Apr 1, 2026
b877810
refactor: text edit
isabel-gomez-gs Apr 1, 2026
79d4cb1
refactor: text edit
isabel-gomez-gs Apr 1, 2026
4c6b7ee
refactor: Reorganize section to preserve column names
isabel-gomez-gs Apr 1, 2026
e22e7ae
refactor: text edit
isabel-gomez-gs Apr 1, 2026
d932eea
refactor: text edit
isabel-gomez-gs Apr 1, 2026
641a502
refactor: discovery mode section
isabel-gomez-gs Apr 1, 2026
931c7fa
refactor: text edit
isabel-gomez-gs Apr 1, 2026
7923809
Include reference to configuration mapping in how to
isabel-gomez-gs Apr 1, 2026
18417af
Add reference to default values
Apr 1, 2026
6ef419e
text edit
isabel-gomez-gs Apr 1, 2026
72c631d
rephrase columns_to_export
isabel-gomez-gs Apr 1, 2026
e20df5c
rearrange secion 2.3 of transformation process reference
Apr 1, 2026
98df0b8
rearrange explanation of qc metrics calculation
isabel-gomez-gs Apr 1, 2026
b6a524c
text edit
isabel-gomez-gs Apr 1, 2026
c501f87
rephrase gene mapping section
Apr 1, 2026
4cd3eaf
text edit
isabel-gomez-gs Apr 1, 2026
6ed40ba
text edit
isabel-gomez-gs Apr 1, 2026
e79462e
text edit
isabel-gomez-gs Apr 1, 2026
f93120e
text edit
isabel-gomez-gs Apr 1, 2026
6f19ee2
Reorganize New groups section
isabel-gomez-gs Apr 1, 2026
e0c88d9
minor text edit
isabel-gomez-gs Apr 1, 2026
08f2492
reorganize volume size and ID section
isabel-gomez-gs Apr 1, 2026
f8fb4a8
minor text edit
isabel-gomez-gs Apr 1, 2026
6a77761
minor text edit
isabel-gomez-gs Apr 1, 2026
5104f65
minor text edit
isabel-gomez-gs Apr 1, 2026
9470c39
Simplify biosample metadata configuration section
Apr 1, 2026
1032f49
Add link to public configs
Apr 1, 2026
debec54
Add example of group creation
isabel-gomez-gs Apr 1, 2026
c57bc29
Add example for linking group determination
Apr 1, 2026
0677fd1
Add public dataset template
Apr 1, 2026
497273e
Update public dataset links
Apr 1, 2026
0ba3ad3
Update internal references
Apr 1, 2026
e5a8cc5
Update quickstart
Apr 1, 2026
1f61be8
Update configs
Apr 1, 2026
ffaf72e
Add table of contents and missing references
Apr 1, 2026
5f82178
Update quickstart guide
Apr 1, 2026
2a58b06
Include modifications requested by Predrag M
Apr 6, 2026
e6b1a63
remove extended explanation from reference
Apr 6, 2026
ce32f83
update about-sc
Apr 8, 2026
129815f
rename notebooks
Apr 8, 2026
eca214d
minor change in notebook title
Apr 9, 2026
9b54894
include links to the rest of documentation
Apr 9, 2026
4714ce4
update references
Apr 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Comment thread
isabel-gomez-gs marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Single-Cell HDF5 Transformations Overview

This transformation converts a single-cell HDF5 file into the ODM-compatible output files. It extracts expression data and related cell metadata, and can optionally harmonize metadata and create or update biosample objects in ODM. The output files are then imported and linked automatically.

The result is feature-level indexed data that is ready for downstream analysis and cross-study discovery without manual file preparation.

## The ODM entity model for single-cell data
Comment thread
isabel-gomez-gs marked this conversation as resolved.

Understanding the transformation requires familiarity with how ODM represents single-cell experiments. ODM organises data around a hierarchy of entities:

- **Sample, Library, and Preparation groups** (collectively referred to as SLP) represent the biological and experimental context of the data. A Sample describes a biological specimen; a Library describes the sequencing library prepared from it; a Preparation describes a preparation step. These entities already exist in ODM for most studies, or can be created by the transformation itself.

- **A Cell Group** represents the collection of individual cells from an experiment, together with their metadata. Each Cell Group must be linked to exactly one parent SLP entity (a Sample, Library, or Preparation group). This linkage is what allows ODM to associate cell-level observations with the correct experimental context.

- **An Expression Group** represents the gene-by-cell expression matrix, compressed for efficient retrieval, together with computed dataset statistics. An Expression Group is always linked to a Cell Group.

The transformation creates the Cell Group and Expression Group and links them into the existing (or newly created) SLP structure. This is why the configuration requires specifying how the resulting Cell Group should be connected to its parent — the linking step is fundamental to how ODM organises and queries the data.

## What the transformation reads from the source file

The transformation extracts three types of data from a HDF5 source file:

**Cell metadata** — extracted primarily from the `obs` in H5AD input file, or the equivalent structure in 10x H5 input. This includes per-cell annotations such as barcodes, cluster assignments, quality control metrics, and any other experimental annotations. Multidimensional representations stored in `obsm` (such as PCA or UMAP coordinates) and pairwise cell annotations from `obsp` can also be extracted.

**Feature metadata** — extracted from `var`, and optionally from `varm` and `varp`. This includes per-gene annotations such as gene identifiers and gene names. For supported species, the transformation can also map Ensembl or NCBI gene identifiers to gene names automatically (see [Gene ID to name mapping](attribute-mapping.md#gene-id-to-name-mapping)).

**The expression matrix** — extracted from `X`, which contains count or normalized expression values. The transformation validates the matrix dimensions against the extracted cell and feature metadata, then writes the matrix in a Brotli-compressed format optimized for ODM ingestion.

## The role of metadata curation
Comment thread
isabel-gomez-gs marked this conversation as resolved.

Metadata curation is optional, but strongly recommended. It standardizes cell metadata so that it can be imported, linked, and indexed correctly in ODM. Certain fields must use the expected names and data types to ensure consistent linking and indexing. The transformation handles this for the user during processing.

As part of curation, the transformation performs automatic attribute mapping: commonly used attribute names from tools such as Seurat, Scanpy, or Cell Ranger are recognized and renamed to the canonical ODM API names without any configuration. Automatic attribute mapping helps harmonizing metadata across datasets, which is essential for cross-study search and downstream analysis. Attributes that do not match any known name are retained and their names are automatically converted to camelCase for consistency with the ODM naming convention. For the full list of recognized names, see the [Attribute Mapping Reference](attribute-mapping.md).

Curation is applied only to the data produced by the transformation for import into ODM. The source file is not modified.

## Biosample metadata and the aggregation model

Some single-cell datasets store tissue, disease, or other biosample-level attributes in cell metadata, repeating the same values for every cell. The transformation can aggregate these attributes into related biosample object: Sample, Library, or Preparation (SLP) objects in ODM.

Aggregation is performed by grouping cells using a designated biosample identifier. Only attributes that are consistent across all cells in the same biosample can be assigned to related biosample objects.

Attributes assigned to biosample objects are automatically removed from the cell metadata. This reduces duplication and improves the overall structure of the imported data.

## Linking created objects

When the transformation uploads a Cell Group, it links it to a parent Sample, Library, or Preparation entity (SLP).

This is usually handled automatically. If the transformation creates new SLP objects, the Cell Group is linked to them. Otherwise, the transformation identifies the most appropriate existing SLP target in ODM. Users can override the automatic behavior by specifying the target explicitly in the configuration.
For details, see [Linking group determination](transformation-process-reference.md#13-linking-group-determination).

The Expression Group created by the transformation is linked to the corresponding Cell Group .

## Dry run mode

Dry run mode lets users validate the transformation setup before running a full import. In this mode, the transformation performs the initial processing steps, including reading the input, extracting metadata, applying curation, and running validation checks. It skips the most time-consuming output-generation steps, such as creating the expression matrix, and does not upload data to ODM.

Dry run mode is useful for checking that the configuration works as expected and that the required inputs, metadata mappings, and linkage settings are resolved correctly before a full run.

When `biosample_metadata` is configured without any `columns_to_export` entries, dry run mode can also be used to inspect which attributes are uniform within each biosample and therefore eligible for re-assigning.

The recommended approach is to iterate on the configuration using dry runs until warnings are resolved, and then run the full transformation. For details, see [How to iterate on a configuration using dry runs](how-to-sc-hdf5-transformations.md#how-to-iterate-on-a-configuration-using-dry-runs).

## Processors Controller API: configurations, images, and jobs

The transformation is managed through the ODM Processors Controller API. It is based on three related components: configurations, images, and jobs.

**Transformation configurations** are JSON documents that define how input files should be processed, including the input format, metadata extraction, and curation rules. Configurations can be created, retrieved, and updated independently of any particular run. The same configuration can be reused across multiple files with the same structure.

**Transformation images** are versioned container images that run the processing logic. Available image versions can be queried through the API. The image used for single-cell HDF5 files is `hdf5-cells`. When starting a job, users can specify either `latest` or a specific release tag.

**Transformation jobs** are the execution records. A job combines a configuration, an image, and one or more input files, runs the transformation, and produces the output and logs. Jobs are independent, so the same input can be run again with a different configuration or image when needed.

## Transformation logs

Each transformation job produces a log that records the processing steps, warnings, detected issues, and created outputs. The log also includes provenance information, such as the source file name and accession, and the accessions of the created objects.
As part of the transformation, the log is uploaded to ODM and stored with the study as an attachment alongside the other generated files. This provides a persistent record of the transformation output. Logs are also available through the API for a limited time. By default, this retention period is two weeks.

## Supported input formats

The transformation supports the following HDF5-based input formats:

- **H5AD (AnnData)** — the native format of the AnnData Python library, widely used for single-cell data processing.
- **10x Genomics H5** — converted internally to H5AD before processing, so the same extraction workflow is used regardless of the input format.
- **Legacy 10x Genomics H5 (v<3)** — supported only for files containing a single genome. Multi-genome legacy files are not supported.

## Known limitations

Currently, only one transformation process can be run per attachment. If there is a need to run another transformation job on the same data, a new copy of attachment should be imported or a new study should be created.

## See also

- [Single-cell data in ODM: Getting Started](quickstart-sc.md) - quick start tutorial for working with single-cell data.
- [How-to Guides](how-to-sc-hdf5-transformations.md) — step-by-step guidance for running the transformation.
- [Configuration Reference](configuration-reference.md) — full configuration schema.
- [Transformation Process Reference](transformation-process-reference.md) — internal processing pipeline.
- [API Reference](api-reference.md) — API endpoints.

256 changes: 256 additions & 0 deletions docs/user-guide/doc-odm-user-guide/api-reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# API Reference: Single-Cell HDF5 Transformation (Processors Controller)

> **Related documentation:** For conceptual background on configurations, images, and jobs, see [About Single-Cell HDF5 Transformations in ODM](about-sc-hdf5-transformations.md). For step-by-step usage of these endpoints, see the [Single-cell data in ODM: Getting Started](quickstart-sc.md) and [How-to Guides](how-to-sc-hdf5-transformations.md). For the configuration `data` object schema, see the [Configuration Reference](configuration-reference.md).

This reference describes all endpoints in the ODM Processors Controller API used to manage and execute single-cell HDF5 transformations. Endpoints are grouped into three resources: Transformation Configurations, Transformation Images, and Transformation Jobs.

---

## Quick Reference

| Operation | Method | Endpoint |
|---|---|---|
| List configurations | `GET` | `/api/v1/transformations/configurations` |
| Get a configuration | `GET` | `/api/v1/transformations/configurations/{id}` |
| Create a configuration | `POST` | `/api/v1/transformations/configurations` |
| Update a configuration | `PUT` | `/api/v1/transformations/configurations/{id}` |
| List images | `GET` | `/api/v1/transformations/images` |
| Submit a job | `POST` | `/api/v1/transformations/jobs` |
| Get job status | `GET` | `/api/v1/transformations/jobs/{id}` |
| Retrieve job logs | `POST` | `/api/v1/transformations/jobs/{id}/logs` |

---

## Transformation Configurations

A transformation configuration is a stored JSON document that defines how a source file should be processed. It contains a human-readable name and description alongside the `data` object, which is the full processing specification passed to the transformation image.

Configurations are independent of any particular run. The same configuration can be reused across multiple jobs and updated iteratively without affecting previous job results.

### List configurations

```
GET /api/v1/transformations/configurations
```

Returns an array of configuration objects. Each entry includes:

| Field | Type | Description |
|---|---|---|
| `id` | integer | Unique identifier for the configuration |
| `name` | string | Human-readable name |
| `description` | string | Human-readable description |

Use this endpoint to discover existing configurations before deciding to create a new one or reuse an existing one.

### Get a configuration

```
GET /api/v1/transformations/configurations/{id}
```

Returns the full configuration object, including the `data` field with all processing rules. Use this to inspect an existing configuration before deciding to update or reuse it.

**Path parameters:**

| Parameter | Type | Description |
|---|---|---|
| `id` | integer | ID of the configuration to retrieve |

### Create a configuration

```
POST /api/v1/transformations/configurations
```

Creates a new transformation configuration and returns its assigned `id`.

**Request body:**

| Field | Type | Required | Description |
|---|---|---|---|
| `name` | string | Yes | Human-readable name for this configuration |
| `description` | string | Yes | Human-readable description |
| `data` | object | Yes | The processing specification. See the [Configuration Reference](configuration-reference.md) for the full schema. |

**Example request body:**

```json
{
"name": "minimal_config",
"description": "Minimal transformation config for H5AD files",
"data": {
"file_type": "h5ad",
"biosample_metadata": null,
"cell_metadata": {
"metadata_keys": {
"obs": "metadata"
}
},
"feature_metadata": {
"metadata_keys": {
"var": "metadata"
}
},
"cell_expression": {
"data_class": "Single-cell transcriptomics"
}
}
}
```

**Response:** The response object includes the `id` assigned to the new configuration. This `id` is required when submitting a job.

### Update a configuration

```
PUT /api/v1/transformations/configurations/{id}
```

Fully replaces the configuration at the given `id` with the provided content.

**Path parameters:**

| Parameter | Type | Description |
|---|---|---|
| `id` | integer | ID of the configuration to update |

**Request body:** Same structure as `POST /api/v1/transformations/configurations`.

---

## Transformation Images

A transformation image is a versioned, containerized processing environment that executes the transformation logic for a specific input format. Images are managed separately from configurations, enabling version-controlled upgrades.

### List images

```
GET /api/v1/transformations/images
```

Returns an array of available image objects.

**Response fields per image:**

| Field | Description |
|---|---|
| `name` | Identifier used when referencing the image in a job (e.g. `"hdf5-cells"`) |
| `description` | Human-readable description of the image's purpose |
| `input_formats` | File formats accepted as input |
| `output_formats` | File formats produced as output |
| `version` | Version tag (e.g. `"latest"` or a specific release tag such as `"0.0.7"`) |

Use this endpoint to confirm image availability and identify the version to specify when submitting a job.

---

## Transformation Jobs

A transformation job binds a configuration and an image to one or more input file accessions and executes the processing pipeline. Each job produces an execution log and, when not in dry-run mode, creates or updates ODM objects.

### Submit a job

```
POST /api/v1/transformations/jobs
```

Creates and submits a new transformation job. The response includes the `id` of the created job, which is required for status and log queries.

**Request body:**

| Field | Type | Required | Description |
|---|---|---|---|
| `configuration_id` | integer | Yes | ID of the transformation configuration to use |
| `dry_run` | boolean | Yes | `true` to simulate the run without writing data to ODM; `false` for a full run |
| `image_reference` | object | Yes | Specifies the image to use. Contains `name` (string) and `version` (string). |
| `input_accessions` | array of strings | Yes | ODM accessions of the input files to process |
| `volume_size` | integer | Yes | Scratch volume size in GB allocated for the job |

**`image_reference` fields:**

| Field | Type | Description |
|---|---|---|
| `name` | string | Image name. Use `"hdf5-cells"` for single-cell HDF5 transformations. |
| `version` | string | Version tag. Use `"latest"` or a specific release tag (e.g. `"0.0.7"`). |

**`volume_size` guidelines:**

| Input format | Recommended `volume_size` |
|---|---|
| H5AD | ≥ 1.4 × size of the original attachment (GB) |
| 10x H5 | ≥ 4 × size of the original attachment (GB) |

H5 files require significantly more scratch space due to the internal conversion to H5AD format.

**Example request body (dry run):**

```json
{
"configuration_id": 42,
"dry_run": true,
"image_reference": {
"name": "hdf5-cells",
"version": "latest"
},
"input_accessions": ["GSF020408"],
"volume_size": 30
}
```

**Example request body (full run):**

```json
{
"configuration_id": 42,
"dry_run": false,
"image_reference": {
"name": "hdf5-cells",
"version": "latest"
},
"input_accessions": ["GSF020408"],
"volume_size": 30
}
```

### Get job status

```
GET /api/v1/transformations/jobs/{id}
```

Returns the job object, including the current `status.state`.

**Path parameters:**

| Parameter | Type | Description |
|---|---|---|
| `id` | integer | ID of the job to query |

**`status.state` values:**

| State | Meaning |
|---|---|
| `RUNNING` | Job is in progress |
| `DONE` | Job finished successfully |
| `FAILED` | Job encountered an error |

### Retrieve job logs

```
POST /api/v1/transformations/jobs/{id}/logs
```

Returns the log records for the specified job. Logs include:

- Configuration validation messages.
- Input file structure report (keys, data types, shapes, attribute names).
- Warnings and errors encountered during metadata extraction and curation.
- Linking validation results (dry-run only).
- Accessions of ODM objects created or updated (full run only).

**Path parameters:**

| Parameter | Type | Description |
|---|---|---|
| `id` | integer | ID of the job whose logs to retrieve |
Loading
Loading