-
Notifications
You must be signed in to change notification settings - Fork 0
Single-cell data transformation documentation #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
75 commits
Select commit
Hold shift + click to select a range
dc3bba9
Single-cell data transformation draft documents
8a6f027
Add different configuration reference options
0d0825f
Refactor: minor text edit
b8b90e8
Fix: Remove dry_run from configuration parameters across the files
c2e695d
Text edit
isabel-gomez-gs 1d87df4
Text edit
isabel-gomez-gs 939a6d1
Modify text describing goal of transformation
isabel-gomez-gs 235ea23
Text edit
isabel-gomez-gs 5062e5b
Text edit
isabel-gomez-gs 6ff6259
Include reference to single cell data user guide
isabel-gomez-gs 76271be
Reference linking group determination
isabel-gomez-gs 76bdaf1
Rename section to reference specific API endpoints
isabel-gomez-gs 5b4f6a1
Rephrase metadata and expression extraction description
isabel-gomez-gs 66dee86
Modify SLP update description
isabel-gomez-gs 489f0f8
Update SLP update behaviour
94effba
Update metadata curation description
isabel-gomez-gs ab066af
Update description of dry-run mode
isabel-gomez-gs 214220d
Update Processors controller API description
4334a13
Merge branch 'docs/sc-transformations' of https://github.com/genestac…
7d72262
Create attribute mapping reference
56b7ca5
Select desired configuration reference option, remove unwanted files
95738f0
Add gene ID - gene mapping in attribute mapping reference
db33768
Remove supported species from gene mapping from config schema
beba176
Update how to guide
82977f4
Update transformation reference
729b41c
Add quick tutorial and configurations for public datasets
a9d46dd
Fix configuration mapping
8c480ca
Update links to ODM documentation
274ece6
Rename quickstart guide
432bb30
refactor: text edit
isabel-gomez-gs 0135faf
refactor: text edit
isabel-gomez-gs dbdfcc4
refactor: text edit
isabel-gomez-gs b877810
refactor: text edit
isabel-gomez-gs 79d4cb1
refactor: text edit
isabel-gomez-gs 4c6b7ee
refactor: Reorganize section to preserve column names
isabel-gomez-gs e22e7ae
refactor: text edit
isabel-gomez-gs d932eea
refactor: text edit
isabel-gomez-gs 641a502
refactor: discovery mode section
isabel-gomez-gs 931c7fa
refactor: text edit
isabel-gomez-gs 7923809
Include reference to configuration mapping in how to
isabel-gomez-gs 18417af
Add reference to default values
6ef419e
text edit
isabel-gomez-gs 72c631d
rephrase columns_to_export
isabel-gomez-gs e20df5c
rearrange secion 2.3 of transformation process reference
98df0b8
rearrange explanation of qc metrics calculation
isabel-gomez-gs b6a524c
text edit
isabel-gomez-gs c501f87
rephrase gene mapping section
4cd3eaf
text edit
isabel-gomez-gs 6ed40ba
text edit
isabel-gomez-gs e79462e
text edit
isabel-gomez-gs f93120e
text edit
isabel-gomez-gs 6f19ee2
Reorganize New groups section
isabel-gomez-gs e0c88d9
minor text edit
isabel-gomez-gs 08f2492
reorganize volume size and ID section
isabel-gomez-gs f8fb4a8
minor text edit
isabel-gomez-gs 6a77761
minor text edit
isabel-gomez-gs 5104f65
minor text edit
isabel-gomez-gs 9470c39
Simplify biosample metadata configuration section
1032f49
Add link to public configs
debec54
Add example of group creation
isabel-gomez-gs c57bc29
Add example for linking group determination
0677fd1
Add public dataset template
497273e
Update public dataset links
0ba3ad3
Update internal references
e5a8cc5
Update quickstart
1f61be8
Update configs
ffaf72e
Add table of contents and missing references
5f82178
Update quickstart guide
2a58b06
Include modifications requested by Predrag M
e6b1a63
remove extended explanation from reference
ce32f83
update about-sc
129815f
rename notebooks
eca214d
minor change in notebook title
9b54894
include links to the rest of documentation
4714ce4
update references
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
98 changes: 98 additions & 0 deletions
98
docs/user-guide/doc-odm-user-guide/about-sc-hdf5-transformations.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # Single-Cell HDF5 Transformations Overview | ||
|
|
||
| This transformation converts a single-cell HDF5 file into the ODM-compatible output files. It extracts expression data and related cell metadata, and can optionally harmonize metadata and create or update biosample objects in ODM. The output files are then imported and linked automatically. | ||
|
|
||
| The result is feature-level indexed data that is ready for downstream analysis and cross-study discovery without manual file preparation. | ||
|
|
||
| ## The ODM entity model for single-cell data | ||
|
isabel-gomez-gs marked this conversation as resolved.
|
||
|
|
||
| Understanding the transformation requires familiarity with how ODM represents single-cell experiments. ODM organises data around a hierarchy of entities: | ||
|
|
||
| - **Sample, Library, and Preparation groups** (collectively referred to as SLP) represent the biological and experimental context of the data. A Sample describes a biological specimen; a Library describes the sequencing library prepared from it; a Preparation describes a preparation step. These entities already exist in ODM for most studies, or can be created by the transformation itself. | ||
|
|
||
| - **A Cell Group** represents the collection of individual cells from an experiment, together with their metadata. Each Cell Group must be linked to exactly one parent SLP entity (a Sample, Library, or Preparation group). This linkage is what allows ODM to associate cell-level observations with the correct experimental context. | ||
|
|
||
| - **An Expression Group** represents the gene-by-cell expression matrix, compressed for efficient retrieval, together with computed dataset statistics. An Expression Group is always linked to a Cell Group. | ||
|
|
||
| The transformation creates the Cell Group and Expression Group and links them into the existing (or newly created) SLP structure. This is why the configuration requires specifying how the resulting Cell Group should be connected to its parent — the linking step is fundamental to how ODM organises and queries the data. | ||
|
|
||
| ## What the transformation reads from the source file | ||
|
|
||
| The transformation extracts three types of data from a HDF5 source file: | ||
|
|
||
| **Cell metadata** — extracted primarily from the `obs` in H5AD input file, or the equivalent structure in 10x H5 input. This includes per-cell annotations such as barcodes, cluster assignments, quality control metrics, and any other experimental annotations. Multidimensional representations stored in `obsm` (such as PCA or UMAP coordinates) and pairwise cell annotations from `obsp` can also be extracted. | ||
|
|
||
| **Feature metadata** — extracted from `var`, and optionally from `varm` and `varp`. This includes per-gene annotations such as gene identifiers and gene names. For supported species, the transformation can also map Ensembl or NCBI gene identifiers to gene names automatically (see [Gene ID to name mapping](attribute-mapping.md#gene-id-to-name-mapping)). | ||
|
|
||
| **The expression matrix** — extracted from `X`, which contains count or normalized expression values. The transformation validates the matrix dimensions against the extracted cell and feature metadata, then writes the matrix in a Brotli-compressed format optimized for ODM ingestion. | ||
|
|
||
| ## The role of metadata curation | ||
|
isabel-gomez-gs marked this conversation as resolved.
|
||
|
|
||
| Metadata curation is optional, but strongly recommended. It standardizes cell metadata so that it can be imported, linked, and indexed correctly in ODM. Certain fields must use the expected names and data types to ensure consistent linking and indexing. The transformation handles this for the user during processing. | ||
|
|
||
| As part of curation, the transformation performs automatic attribute mapping: commonly used attribute names from tools such as Seurat, Scanpy, or Cell Ranger are recognized and renamed to the canonical ODM API names without any configuration. Automatic attribute mapping helps harmonizing metadata across datasets, which is essential for cross-study search and downstream analysis. Attributes that do not match any known name are retained and their names are automatically converted to camelCase for consistency with the ODM naming convention. For the full list of recognized names, see the [Attribute Mapping Reference](attribute-mapping.md). | ||
|
|
||
| Curation is applied only to the data produced by the transformation for import into ODM. The source file is not modified. | ||
|
|
||
| ## Biosample metadata and the aggregation model | ||
|
|
||
| Some single-cell datasets store tissue, disease, or other biosample-level attributes in cell metadata, repeating the same values for every cell. The transformation can aggregate these attributes into related biosample object: Sample, Library, or Preparation (SLP) objects in ODM. | ||
|
|
||
| Aggregation is performed by grouping cells using a designated biosample identifier. Only attributes that are consistent across all cells in the same biosample can be assigned to related biosample objects. | ||
|
|
||
| Attributes assigned to biosample objects are automatically removed from the cell metadata. This reduces duplication and improves the overall structure of the imported data. | ||
|
|
||
| ## Linking created objects | ||
|
|
||
| When the transformation uploads a Cell Group, it links it to a parent Sample, Library, or Preparation entity (SLP). | ||
|
|
||
| This is usually handled automatically. If the transformation creates new SLP objects, the Cell Group is linked to them. Otherwise, the transformation identifies the most appropriate existing SLP target in ODM. Users can override the automatic behavior by specifying the target explicitly in the configuration. | ||
| For details, see [Linking group determination](transformation-process-reference.md#13-linking-group-determination). | ||
|
|
||
| The Expression Group created by the transformation is linked to the corresponding Cell Group . | ||
|
|
||
| ## Dry run mode | ||
|
|
||
| Dry run mode lets users validate the transformation setup before running a full import. In this mode, the transformation performs the initial processing steps, including reading the input, extracting metadata, applying curation, and running validation checks. It skips the most time-consuming output-generation steps, such as creating the expression matrix, and does not upload data to ODM. | ||
|
|
||
| Dry run mode is useful for checking that the configuration works as expected and that the required inputs, metadata mappings, and linkage settings are resolved correctly before a full run. | ||
|
|
||
| When `biosample_metadata` is configured without any `columns_to_export` entries, dry run mode can also be used to inspect which attributes are uniform within each biosample and therefore eligible for re-assigning. | ||
|
|
||
| The recommended approach is to iterate on the configuration using dry runs until warnings are resolved, and then run the full transformation. For details, see [How to iterate on a configuration using dry runs](how-to-sc-hdf5-transformations.md#how-to-iterate-on-a-configuration-using-dry-runs). | ||
|
|
||
| ## Processors Controller API: configurations, images, and jobs | ||
|
|
||
| The transformation is managed through the ODM Processors Controller API. It is based on three related components: configurations, images, and jobs. | ||
|
|
||
| **Transformation configurations** are JSON documents that define how input files should be processed, including the input format, metadata extraction, and curation rules. Configurations can be created, retrieved, and updated independently of any particular run. The same configuration can be reused across multiple files with the same structure. | ||
|
|
||
| **Transformation images** are versioned container images that run the processing logic. Available image versions can be queried through the API. The image used for single-cell HDF5 files is `hdf5-cells`. When starting a job, users can specify either `latest` or a specific release tag. | ||
|
|
||
| **Transformation jobs** are the execution records. A job combines a configuration, an image, and one or more input files, runs the transformation, and produces the output and logs. Jobs are independent, so the same input can be run again with a different configuration or image when needed. | ||
|
|
||
| ## Transformation logs | ||
|
|
||
| Each transformation job produces a log that records the processing steps, warnings, detected issues, and created outputs. The log also includes provenance information, such as the source file name and accession, and the accessions of the created objects. | ||
| As part of the transformation, the log is uploaded to ODM and stored with the study as an attachment alongside the other generated files. This provides a persistent record of the transformation output. Logs are also available through the API for a limited time. By default, this retention period is two weeks. | ||
|
|
||
| ## Supported input formats | ||
|
|
||
| The transformation supports the following HDF5-based input formats: | ||
|
|
||
| - **H5AD (AnnData)** — the native format of the AnnData Python library, widely used for single-cell data processing. | ||
| - **10x Genomics H5** — converted internally to H5AD before processing, so the same extraction workflow is used regardless of the input format. | ||
| - **Legacy 10x Genomics H5 (v<3)** — supported only for files containing a single genome. Multi-genome legacy files are not supported. | ||
|
|
||
| ## Known limitations | ||
|
|
||
| Currently, only one transformation process can be run per attachment. If there is a need to run another transformation job on the same data, a new copy of attachment should be imported or a new study should be created. | ||
|
|
||
| ## See also | ||
|
|
||
| - [Single-cell data in ODM: Getting Started](quickstart-sc.md) - quick start tutorial for working with single-cell data. | ||
| - [How-to Guides](how-to-sc-hdf5-transformations.md) — step-by-step guidance for running the transformation. | ||
| - [Configuration Reference](configuration-reference.md) — full configuration schema. | ||
| - [Transformation Process Reference](transformation-process-reference.md) — internal processing pipeline. | ||
| - [API Reference](api-reference.md) — API endpoints. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,256 @@ | ||
| # API Reference: Single-Cell HDF5 Transformation (Processors Controller) | ||
|
|
||
| > **Related documentation:** For conceptual background on configurations, images, and jobs, see [About Single-Cell HDF5 Transformations in ODM](about-sc-hdf5-transformations.md). For step-by-step usage of these endpoints, see the [Single-cell data in ODM: Getting Started](quickstart-sc.md) and [How-to Guides](how-to-sc-hdf5-transformations.md). For the configuration `data` object schema, see the [Configuration Reference](configuration-reference.md). | ||
|
|
||
| This reference describes all endpoints in the ODM Processors Controller API used to manage and execute single-cell HDF5 transformations. Endpoints are grouped into three resources: Transformation Configurations, Transformation Images, and Transformation Jobs. | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Reference | ||
|
|
||
| | Operation | Method | Endpoint | | ||
| |---|---|---| | ||
| | List configurations | `GET` | `/api/v1/transformations/configurations` | | ||
| | Get a configuration | `GET` | `/api/v1/transformations/configurations/{id}` | | ||
| | Create a configuration | `POST` | `/api/v1/transformations/configurations` | | ||
| | Update a configuration | `PUT` | `/api/v1/transformations/configurations/{id}` | | ||
| | List images | `GET` | `/api/v1/transformations/images` | | ||
| | Submit a job | `POST` | `/api/v1/transformations/jobs` | | ||
| | Get job status | `GET` | `/api/v1/transformations/jobs/{id}` | | ||
| | Retrieve job logs | `POST` | `/api/v1/transformations/jobs/{id}/logs` | | ||
|
|
||
| --- | ||
|
|
||
| ## Transformation Configurations | ||
|
|
||
| A transformation configuration is a stored JSON document that defines how a source file should be processed. It contains a human-readable name and description alongside the `data` object, which is the full processing specification passed to the transformation image. | ||
|
|
||
| Configurations are independent of any particular run. The same configuration can be reused across multiple jobs and updated iteratively without affecting previous job results. | ||
|
|
||
| ### List configurations | ||
|
|
||
| ``` | ||
| GET /api/v1/transformations/configurations | ||
| ``` | ||
|
|
||
| Returns an array of configuration objects. Each entry includes: | ||
|
|
||
| | Field | Type | Description | | ||
| |---|---|---| | ||
| | `id` | integer | Unique identifier for the configuration | | ||
| | `name` | string | Human-readable name | | ||
| | `description` | string | Human-readable description | | ||
|
|
||
| Use this endpoint to discover existing configurations before deciding to create a new one or reuse an existing one. | ||
|
|
||
| ### Get a configuration | ||
|
|
||
| ``` | ||
| GET /api/v1/transformations/configurations/{id} | ||
| ``` | ||
|
|
||
| Returns the full configuration object, including the `data` field with all processing rules. Use this to inspect an existing configuration before deciding to update or reuse it. | ||
|
|
||
| **Path parameters:** | ||
|
|
||
| | Parameter | Type | Description | | ||
| |---|---|---| | ||
| | `id` | integer | ID of the configuration to retrieve | | ||
|
|
||
| ### Create a configuration | ||
|
|
||
| ``` | ||
| POST /api/v1/transformations/configurations | ||
| ``` | ||
|
|
||
| Creates a new transformation configuration and returns its assigned `id`. | ||
|
|
||
| **Request body:** | ||
|
|
||
| | Field | Type | Required | Description | | ||
| |---|---|---|---| | ||
| | `name` | string | Yes | Human-readable name for this configuration | | ||
| | `description` | string | Yes | Human-readable description | | ||
| | `data` | object | Yes | The processing specification. See the [Configuration Reference](configuration-reference.md) for the full schema. | | ||
|
|
||
| **Example request body:** | ||
|
|
||
| ```json | ||
| { | ||
| "name": "minimal_config", | ||
| "description": "Minimal transformation config for H5AD files", | ||
| "data": { | ||
| "file_type": "h5ad", | ||
| "biosample_metadata": null, | ||
| "cell_metadata": { | ||
| "metadata_keys": { | ||
| "obs": "metadata" | ||
| } | ||
| }, | ||
| "feature_metadata": { | ||
| "metadata_keys": { | ||
| "var": "metadata" | ||
| } | ||
| }, | ||
| "cell_expression": { | ||
| "data_class": "Single-cell transcriptomics" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Response:** The response object includes the `id` assigned to the new configuration. This `id` is required when submitting a job. | ||
|
|
||
| ### Update a configuration | ||
|
|
||
| ``` | ||
| PUT /api/v1/transformations/configurations/{id} | ||
| ``` | ||
|
|
||
| Fully replaces the configuration at the given `id` with the provided content. | ||
|
|
||
| **Path parameters:** | ||
|
|
||
| | Parameter | Type | Description | | ||
| |---|---|---| | ||
| | `id` | integer | ID of the configuration to update | | ||
|
|
||
| **Request body:** Same structure as `POST /api/v1/transformations/configurations`. | ||
|
|
||
| --- | ||
|
|
||
| ## Transformation Images | ||
|
|
||
| A transformation image is a versioned, containerized processing environment that executes the transformation logic for a specific input format. Images are managed separately from configurations, enabling version-controlled upgrades. | ||
|
|
||
| ### List images | ||
|
|
||
| ``` | ||
| GET /api/v1/transformations/images | ||
| ``` | ||
|
|
||
| Returns an array of available image objects. | ||
|
|
||
| **Response fields per image:** | ||
|
|
||
| | Field | Description | | ||
| |---|---| | ||
| | `name` | Identifier used when referencing the image in a job (e.g. `"hdf5-cells"`) | | ||
| | `description` | Human-readable description of the image's purpose | | ||
| | `input_formats` | File formats accepted as input | | ||
| | `output_formats` | File formats produced as output | | ||
| | `version` | Version tag (e.g. `"latest"` or a specific release tag such as `"0.0.7"`) | | ||
|
|
||
| Use this endpoint to confirm image availability and identify the version to specify when submitting a job. | ||
|
|
||
| --- | ||
|
|
||
| ## Transformation Jobs | ||
|
|
||
| A transformation job binds a configuration and an image to one or more input file accessions and executes the processing pipeline. Each job produces an execution log and, when not in dry-run mode, creates or updates ODM objects. | ||
|
|
||
| ### Submit a job | ||
|
|
||
| ``` | ||
| POST /api/v1/transformations/jobs | ||
| ``` | ||
|
|
||
| Creates and submits a new transformation job. The response includes the `id` of the created job, which is required for status and log queries. | ||
|
|
||
| **Request body:** | ||
|
|
||
| | Field | Type | Required | Description | | ||
| |---|---|---|---| | ||
| | `configuration_id` | integer | Yes | ID of the transformation configuration to use | | ||
| | `dry_run` | boolean | Yes | `true` to simulate the run without writing data to ODM; `false` for a full run | | ||
| | `image_reference` | object | Yes | Specifies the image to use. Contains `name` (string) and `version` (string). | | ||
| | `input_accessions` | array of strings | Yes | ODM accessions of the input files to process | | ||
| | `volume_size` | integer | Yes | Scratch volume size in GB allocated for the job | | ||
|
|
||
| **`image_reference` fields:** | ||
|
|
||
| | Field | Type | Description | | ||
| |---|---|---| | ||
| | `name` | string | Image name. Use `"hdf5-cells"` for single-cell HDF5 transformations. | | ||
| | `version` | string | Version tag. Use `"latest"` or a specific release tag (e.g. `"0.0.7"`). | | ||
|
|
||
| **`volume_size` guidelines:** | ||
|
|
||
| | Input format | Recommended `volume_size` | | ||
| |---|---| | ||
| | H5AD | ≥ 1.4 × size of the original attachment (GB) | | ||
| | 10x H5 | ≥ 4 × size of the original attachment (GB) | | ||
|
|
||
| H5 files require significantly more scratch space due to the internal conversion to H5AD format. | ||
|
|
||
| **Example request body (dry run):** | ||
|
|
||
| ```json | ||
| { | ||
| "configuration_id": 42, | ||
| "dry_run": true, | ||
| "image_reference": { | ||
| "name": "hdf5-cells", | ||
| "version": "latest" | ||
| }, | ||
| "input_accessions": ["GSF020408"], | ||
| "volume_size": 30 | ||
| } | ||
| ``` | ||
|
|
||
| **Example request body (full run):** | ||
|
|
||
| ```json | ||
| { | ||
| "configuration_id": 42, | ||
| "dry_run": false, | ||
| "image_reference": { | ||
| "name": "hdf5-cells", | ||
| "version": "latest" | ||
| }, | ||
| "input_accessions": ["GSF020408"], | ||
| "volume_size": 30 | ||
| } | ||
| ``` | ||
|
|
||
| ### Get job status | ||
|
|
||
| ``` | ||
| GET /api/v1/transformations/jobs/{id} | ||
| ``` | ||
|
|
||
| Returns the job object, including the current `status.state`. | ||
|
|
||
| **Path parameters:** | ||
|
|
||
| | Parameter | Type | Description | | ||
| |---|---|---| | ||
| | `id` | integer | ID of the job to query | | ||
|
|
||
| **`status.state` values:** | ||
|
|
||
| | State | Meaning | | ||
| |---|---| | ||
| | `RUNNING` | Job is in progress | | ||
| | `DONE` | Job finished successfully | | ||
| | `FAILED` | Job encountered an error | | ||
|
|
||
| ### Retrieve job logs | ||
|
|
||
| ``` | ||
| POST /api/v1/transformations/jobs/{id}/logs | ||
| ``` | ||
|
|
||
| Returns the log records for the specified job. Logs include: | ||
|
|
||
| - Configuration validation messages. | ||
| - Input file structure report (keys, data types, shapes, attribute names). | ||
| - Warnings and errors encountered during metadata extraction and curation. | ||
| - Linking validation results (dry-run only). | ||
| - Accessions of ODM objects created or updated (full run only). | ||
|
|
||
| **Path parameters:** | ||
|
|
||
| | Parameter | Type | Description | | ||
| |---|---|---| | ||
| | `id` | integer | ID of the job whose logs to retrieve | |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.