diff --git a/docs/user-guide/doc-odm-user-guide/about-sc-hdf5-transformations.md b/docs/user-guide/doc-odm-user-guide/about-sc-hdf5-transformations.md new file mode 100644 index 0000000..4b71ec9 --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/about-sc-hdf5-transformations.md @@ -0,0 +1,98 @@ +# Single-Cell HDF5 Transformations Overview + +This transformation converts a single-cell HDF5 file into the ODM-compatible output files. It extracts expression data and related cell metadata, and can optionally harmonize metadata and create or update biosample objects in ODM. The output files are then imported and linked automatically. + +The result is feature-level indexed data that is ready for downstream analysis and cross-study discovery without manual file preparation. + +## The ODM entity model for single-cell data + +Understanding the transformation requires familiarity with how ODM represents single-cell experiments. ODM organises data around a hierarchy of entities: + +- **Sample, Library, and Preparation groups** (collectively referred to as SLP) represent the biological and experimental context of the data. A Sample describes a biological specimen; a Library describes the sequencing library prepared from it; a Preparation describes a preparation step. These entities already exist in ODM for most studies, or can be created by the transformation itself. + +- **A Cell Group** represents the collection of individual cells from an experiment, together with their metadata. Each Cell Group must be linked to exactly one parent SLP entity (a Sample, Library, or Preparation group). This linkage is what allows ODM to associate cell-level observations with the correct experimental context. + +- **An Expression Group** represents the gene-by-cell expression matrix, compressed for efficient retrieval, together with computed dataset statistics. An Expression Group is always linked to a Cell Group. + +The transformation creates the Cell Group and Expression Group and links them into the existing (or newly created) SLP structure. This is why the configuration requires specifying how the resulting Cell Group should be connected to its parent — the linking step is fundamental to how ODM organises and queries the data. + +## What the transformation reads from the source file + +The transformation extracts three types of data from a HDF5 source file: + +**Cell metadata** — extracted primarily from the `obs` in H5AD input file, or the equivalent structure in 10x H5 input. This includes per-cell annotations such as barcodes, cluster assignments, quality control metrics, and any other experimental annotations. Multidimensional representations stored in `obsm` (such as PCA or UMAP coordinates) and pairwise cell annotations from `obsp` can also be extracted. + +**Feature metadata** — extracted from `var`, and optionally from `varm` and `varp`. This includes per-gene annotations such as gene identifiers and gene names. For supported species, the transformation can also map Ensembl or NCBI gene identifiers to gene names automatically (see [Gene ID to name mapping](attribute-mapping.md#gene-id-to-name-mapping)). + +**The expression matrix** — extracted from `X`, which contains count or normalized expression values. The transformation validates the matrix dimensions against the extracted cell and feature metadata, then writes the matrix in a Brotli-compressed format optimized for ODM ingestion. + +## The role of metadata curation + +Metadata curation is optional, but strongly recommended. It standardizes cell metadata so that it can be imported, linked, and indexed correctly in ODM. Certain fields must use the expected names and data types to ensure consistent linking and indexing. The transformation handles this for the user during processing. + +As part of curation, the transformation performs automatic attribute mapping: commonly used attribute names from tools such as Seurat, Scanpy, or Cell Ranger are recognized and renamed to the canonical ODM API names without any configuration. Automatic attribute mapping helps harmonizing metadata across datasets, which is essential for cross-study search and downstream analysis. Attributes that do not match any known name are retained and their names are automatically converted to camelCase for consistency with the ODM naming convention. For the full list of recognized names, see the [Attribute Mapping Reference](attribute-mapping.md). + +Curation is applied only to the data produced by the transformation for import into ODM. The source file is not modified. + +## Biosample metadata and the aggregation model + +Some single-cell datasets store tissue, disease, or other biosample-level attributes in cell metadata, repeating the same values for every cell. The transformation can aggregate these attributes into related biosample object: Sample, Library, or Preparation (SLP) objects in ODM. + +Aggregation is performed by grouping cells using a designated biosample identifier. Only attributes that are consistent across all cells in the same biosample can be assigned to related biosample objects. + +Attributes assigned to biosample objects are automatically removed from the cell metadata. This reduces duplication and improves the overall structure of the imported data. + +## Linking created objects + +When the transformation uploads a Cell Group, it links it to a parent Sample, Library, or Preparation entity (SLP). + +This is usually handled automatically. If the transformation creates new SLP objects, the Cell Group is linked to them. Otherwise, the transformation identifies the most appropriate existing SLP target in ODM. Users can override the automatic behavior by specifying the target explicitly in the configuration. +For details, see [Linking group determination](transformation-process-reference.md#13-linking-group-determination). + +The Expression Group created by the transformation is linked to the corresponding Cell Group . + +## Dry run mode + +Dry run mode lets users validate the transformation setup before running a full import. In this mode, the transformation performs the initial processing steps, including reading the input, extracting metadata, applying curation, and running validation checks. It skips the most time-consuming output-generation steps, such as creating the expression matrix, and does not upload data to ODM. + +Dry run mode is useful for checking that the configuration works as expected and that the required inputs, metadata mappings, and linkage settings are resolved correctly before a full run. + +When `biosample_metadata` is configured without any `columns_to_export` entries, dry run mode can also be used to inspect which attributes are uniform within each biosample and therefore eligible for re-assigning. + +The recommended approach is to iterate on the configuration using dry runs until warnings are resolved, and then run the full transformation. For details, see [How to iterate on a configuration using dry runs](how-to-sc-hdf5-transformations.md#how-to-iterate-on-a-configuration-using-dry-runs). + +## Processors Controller API: configurations, images, and jobs + +The transformation is managed through the ODM Processors Controller API. It is based on three related components: configurations, images, and jobs. + +**Transformation configurations** are JSON documents that define how input files should be processed, including the input format, metadata extraction, and curation rules. Configurations can be created, retrieved, and updated independently of any particular run. The same configuration can be reused across multiple files with the same structure. + +**Transformation images** are versioned container images that run the processing logic. Available image versions can be queried through the API. The image used for single-cell HDF5 files is `hdf5-cells`. When starting a job, users can specify either `latest` or a specific release tag. + +**Transformation jobs** are the execution records. A job combines a configuration, an image, and one or more input files, runs the transformation, and produces the output and logs. Jobs are independent, so the same input can be run again with a different configuration or image when needed. + +## Transformation logs + +Each transformation job produces a log that records the processing steps, warnings, detected issues, and created outputs. The log also includes provenance information, such as the source file name and accession, and the accessions of the created objects. +As part of the transformation, the log is uploaded to ODM and stored with the study as an attachment alongside the other generated files. This provides a persistent record of the transformation output. Logs are also available through the API for a limited time. By default, this retention period is two weeks. + +## Supported input formats + +The transformation supports the following HDF5-based input formats: + +- **H5AD (AnnData)** — the native format of the AnnData Python library, widely used for single-cell data processing. +- **10x Genomics H5** — converted internally to H5AD before processing, so the same extraction workflow is used regardless of the input format. +- **Legacy 10x Genomics H5 (v<3)** — supported only for files containing a single genome. Multi-genome legacy files are not supported. + +## Known limitations + +Currently, only one transformation process can be run per attachment. If there is a need to run another transformation job on the same data, a new copy of attachment should be imported or a new study should be created. + +## See also + +- [Single-cell data in ODM: Getting Started](quickstart-sc.md) - quick start tutorial for working with single-cell data. +- [How-to Guides](how-to-sc-hdf5-transformations.md) — step-by-step guidance for running the transformation. +- [Configuration Reference](configuration-reference.md) — full configuration schema. +- [Transformation Process Reference](transformation-process-reference.md) — internal processing pipeline. +- [API Reference](api-reference.md) — API endpoints. + diff --git a/docs/user-guide/doc-odm-user-guide/api-reference.md b/docs/user-guide/doc-odm-user-guide/api-reference.md new file mode 100644 index 0000000..4f896cd --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/api-reference.md @@ -0,0 +1,256 @@ +# API Reference: Single-Cell HDF5 Transformation (Processors Controller) + +> **Related documentation:** For conceptual background on configurations, images, and jobs, see [About Single-Cell HDF5 Transformations in ODM](about-sc-hdf5-transformations.md). For step-by-step usage of these endpoints, see the [Single-cell data in ODM: Getting Started](quickstart-sc.md) and [How-to Guides](how-to-sc-hdf5-transformations.md). For the configuration `data` object schema, see the [Configuration Reference](configuration-reference.md). + +This reference describes all endpoints in the ODM Processors Controller API used to manage and execute single-cell HDF5 transformations. Endpoints are grouped into three resources: Transformation Configurations, Transformation Images, and Transformation Jobs. + +--- + +## Quick Reference + +| Operation | Method | Endpoint | +|---|---|---| +| List configurations | `GET` | `/api/v1/transformations/configurations` | +| Get a configuration | `GET` | `/api/v1/transformations/configurations/{id}` | +| Create a configuration | `POST` | `/api/v1/transformations/configurations` | +| Update a configuration | `PUT` | `/api/v1/transformations/configurations/{id}` | +| List images | `GET` | `/api/v1/transformations/images` | +| Submit a job | `POST` | `/api/v1/transformations/jobs` | +| Get job status | `GET` | `/api/v1/transformations/jobs/{id}` | +| Retrieve job logs | `POST` | `/api/v1/transformations/jobs/{id}/logs` | + +--- + +## Transformation Configurations + +A transformation configuration is a stored JSON document that defines how a source file should be processed. It contains a human-readable name and description alongside the `data` object, which is the full processing specification passed to the transformation image. + +Configurations are independent of any particular run. The same configuration can be reused across multiple jobs and updated iteratively without affecting previous job results. + +### List configurations + +``` +GET /api/v1/transformations/configurations +``` + +Returns an array of configuration objects. Each entry includes: + +| Field | Type | Description | +|---|---|---| +| `id` | integer | Unique identifier for the configuration | +| `name` | string | Human-readable name | +| `description` | string | Human-readable description | + +Use this endpoint to discover existing configurations before deciding to create a new one or reuse an existing one. + +### Get a configuration + +``` +GET /api/v1/transformations/configurations/{id} +``` + +Returns the full configuration object, including the `data` field with all processing rules. Use this to inspect an existing configuration before deciding to update or reuse it. + +**Path parameters:** + +| Parameter | Type | Description | +|---|---|---| +| `id` | integer | ID of the configuration to retrieve | + +### Create a configuration + +``` +POST /api/v1/transformations/configurations +``` + +Creates a new transformation configuration and returns its assigned `id`. + +**Request body:** + +| Field | Type | Required | Description | +|---|---|---|---| +| `name` | string | Yes | Human-readable name for this configuration | +| `description` | string | Yes | Human-readable description | +| `data` | object | Yes | The processing specification. See the [Configuration Reference](configuration-reference.md) for the full schema. | + +**Example request body:** + +```json +{ + "name": "minimal_config", + "description": "Minimal transformation config for H5AD files", + "data": { + "file_type": "h5ad", + "biosample_metadata": null, + "cell_metadata": { + "metadata_keys": { + "obs": "metadata" + } + }, + "feature_metadata": { + "metadata_keys": { + "var": "metadata" + } + }, + "cell_expression": { + "data_class": "Single-cell transcriptomics" + } + } +} +``` + +**Response:** The response object includes the `id` assigned to the new configuration. This `id` is required when submitting a job. + +### Update a configuration + +``` +PUT /api/v1/transformations/configurations/{id} +``` + +Fully replaces the configuration at the given `id` with the provided content. + +**Path parameters:** + +| Parameter | Type | Description | +|---|---|---| +| `id` | integer | ID of the configuration to update | + +**Request body:** Same structure as `POST /api/v1/transformations/configurations`. + +--- + +## Transformation Images + +A transformation image is a versioned, containerized processing environment that executes the transformation logic for a specific input format. Images are managed separately from configurations, enabling version-controlled upgrades. + +### List images + +``` +GET /api/v1/transformations/images +``` + +Returns an array of available image objects. + +**Response fields per image:** + +| Field | Description | +|---|---| +| `name` | Identifier used when referencing the image in a job (e.g. `"hdf5-cells"`) | +| `description` | Human-readable description of the image's purpose | +| `input_formats` | File formats accepted as input | +| `output_formats` | File formats produced as output | +| `version` | Version tag (e.g. `"latest"` or a specific release tag such as `"0.0.7"`) | + +Use this endpoint to confirm image availability and identify the version to specify when submitting a job. + +--- + +## Transformation Jobs + +A transformation job binds a configuration and an image to one or more input file accessions and executes the processing pipeline. Each job produces an execution log and, when not in dry-run mode, creates or updates ODM objects. + +### Submit a job + +``` +POST /api/v1/transformations/jobs +``` + +Creates and submits a new transformation job. The response includes the `id` of the created job, which is required for status and log queries. + +**Request body:** + +| Field | Type | Required | Description | +|---|---|---|---| +| `configuration_id` | integer | Yes | ID of the transformation configuration to use | +| `dry_run` | boolean | Yes | `true` to simulate the run without writing data to ODM; `false` for a full run | +| `image_reference` | object | Yes | Specifies the image to use. Contains `name` (string) and `version` (string). | +| `input_accessions` | array of strings | Yes | ODM accessions of the input files to process | +| `volume_size` | integer | Yes | Scratch volume size in GB allocated for the job | + +**`image_reference` fields:** + +| Field | Type | Description | +|---|---|---| +| `name` | string | Image name. Use `"hdf5-cells"` for single-cell HDF5 transformations. | +| `version` | string | Version tag. Use `"latest"` or a specific release tag (e.g. `"0.0.7"`). | + +**`volume_size` guidelines:** + +| Input format | Recommended `volume_size` | +|---|---| +| H5AD | ≥ 1.4 × size of the original attachment (GB) | +| 10x H5 | ≥ 4 × size of the original attachment (GB) | + +H5 files require significantly more scratch space due to the internal conversion to H5AD format. + +**Example request body (dry run):** + +```json +{ + "configuration_id": 42, + "dry_run": true, + "image_reference": { + "name": "hdf5-cells", + "version": "latest" + }, + "input_accessions": ["GSF020408"], + "volume_size": 30 +} +``` + +**Example request body (full run):** + +```json +{ + "configuration_id": 42, + "dry_run": false, + "image_reference": { + "name": "hdf5-cells", + "version": "latest" + }, + "input_accessions": ["GSF020408"], + "volume_size": 30 +} +``` + +### Get job status + +``` +GET /api/v1/transformations/jobs/{id} +``` + +Returns the job object, including the current `status.state`. + +**Path parameters:** + +| Parameter | Type | Description | +|---|---|---| +| `id` | integer | ID of the job to query | + +**`status.state` values:** + +| State | Meaning | +|---|---| +| `RUNNING` | Job is in progress | +| `DONE` | Job finished successfully | +| `FAILED` | Job encountered an error | + +### Retrieve job logs + +``` +POST /api/v1/transformations/jobs/{id}/logs +``` + +Returns the log records for the specified job. Logs include: + +- Configuration validation messages. +- Input file structure report (keys, data types, shapes, attribute names). +- Warnings and errors encountered during metadata extraction and curation. +- Linking validation results (dry-run only). +- Accessions of ODM objects created or updated (full run only). + +**Path parameters:** + +| Parameter | Type | Description | +|---|---|---| +| `id` | integer | ID of the job whose logs to retrieve | diff --git a/docs/user-guide/doc-odm-user-guide/attribute-mapping.md b/docs/user-guide/doc-odm-user-guide/attribute-mapping.md new file mode 100644 index 0000000..c4bc1be --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/attribute-mapping.md @@ -0,0 +1,62 @@ +# Attribute Mapping Reference + +During metadata curation, the transformation automatically maps commonly used attribute names found in source HDF5 files to the canonical ODM API names. + +Mapping is applied separately to cell metadata and feature metadata. When an attribute in the source file matches one of the known alternative names listed below, it is renamed to the corresponding ODM API display name. Attributes that do not match any known name are converted to camelCase. + +## Cell metadata attributes + +The table below lists the canonical ODM API name for each attribute alongside the alternative source names that are automatically recognized. + +| ODM API display name | Alternative names | +|---|---| +| cellID | — | +| barcode | — | +| batch | `sample_id`, `sample`, `run_id` | +| cellType | `cell_type`, `celltype`, `ident`, `labels` | +| cluster | `cluster_louvain`, `cluster_leiden`, `seurat_clusters` | +| nCounts | `n_counts`, `umi_count`, `nCount_RNA`, `total_umi`, `n_umi`, `n_reads`, `nUMI`, `UMI_count` | +| percentMito | `percent_mito`, `percent_mt`, `percent.mt`, `pct_mt`, `pct_mito`, `pct_counts_mito`, `percent.mito`, `percent.mito.raw`, `mito_ratio`, `pct_counts_mt` | +| umap | `X_umap`, `UMAP` | +| pca | `X_pca`, `PCA` | +| tsne | `X_tsne`, `tSNE` | +| pcaHarmony | `pca_harmony`, `X_harmony`, `harmony_embedding`, `X_pca_harmony` | +| nGenes | `n_genes`, `n_genes_by_counts`, `nGene`, `n_features`, `nFeature_RNA`, `genes_detected`, `detected_genes`, `gene_count`, `Total_Genes_Detected` | +| mitoCounts | `mito_counts`, `total_counts_mt`, `total_counts_mito`, `subsets_mt_sum`, `mt_sum`, `MT_sum` | +| riboCounts | `ribo_counts`, `total_counts_ribo`, `total_counts_rb`, `subsets_ribo_sum`, `rb_counts`, `rb_sum` | +| percentRibo | `percent_ribo`, `percent_rb`, `percent.rb`, `pct_counts_ribo`, `ribo_ratio`, `pct_ribo`, `pct_counts_rb`, `pct_counts_rrna` | +| percentHemoglobin | `percent_hb`, `pct_hb`, `hemoglobin_fraction`, `prop_hb`, `percent_hemoglobin` | +| doubletStatus | `doublet_status`, `is_doublet`, `predicted_doublet`, `multiplet_status` | +| doubletScore | `doublet_score`, `scrublet_score`, `doublet_probability`, `multiplet_score`, `doublet_stat` | +| sScore | `S_score`, `s.score`, `S.Score`, `s_phase_score`, `S_phase_probability` | +| g2mScore | `G2M_score`, `g2m.score`, `G2M.Score`, `g2m_phase_score`, `G2M_phase_probability` | +| cellCycle | `phase`, `cell_cycle_phase`, `cc_phase`, `cycle_stage` | +| ambientFraction | `ambient_fraction`, `decontX_score`, `rho`, `contamination_fraction`, `ambient_rna_percent`, `soup_fraction`, `soup_frac` | + +## Feature metadata attributes + +The table below lists the canonical ODM API name for each feature attribute alongside the alternative source names that are automatically recognized. + +| ODM API display name | Alternative names | +|---|---| +| geneId | `gene_id` (index), `gene_ids`, `ensembl_id`, `feature_id`, `stable_id`, `ENSEMBL` | +| gene | `symbol`, `symbols`, `gene_symbol`, `gene_symbols`, `feature_name`, `display_name`, `name`, `gene_name` | +| totalCounts | `total_counts`, `gene_total`, `sum_counts`, `count_sum`, `total_umis` | +| nCellsByCounts | `n_cells_by_counts`, `n_cells`, `num_cells`, `n_obs`, `num_cells_expressed` | +| meanCounts | `mean_counts`, `avg_exp`, `obs_mean`, `means` | +| pctDropoutByCounts | `pct_dropout_by_counts`, `pct_dropout`, `percent_dropout`, `dropout_rate` | + +### Gene ID to name mapping + +When feature metadata contains a `geneId` column but no gene name column, the transformation can automatically resolve gene names from a built-in reference. This is controlled by the `map_gene_ids_to_names` parameter in the `feature_metadata` configuration block, which is enabled by default. Set it to `false` for proteomics or other non-gene-ID data where this behaviour is not appropriate. + +The mapping is performed using Ensembl and NCBI reference data. Both Ensembl gene IDs (e.g. `ENSG...`) and NCBI gene IDs are supported. The following organisms are supported in `hdf5-cells`: + +| Organism | Genome version | Ensembl release | NCBI release | +|----------|----------------|-----------------|--------------| +| *Homo sapiens* | GRCh38.p14 | 115 | GCF_000001405.40-RS_2025_08 | +| *Mus musculus* | GRCm39 | 115 | GCF_000001635.27-RS_2024_02 | +| *Rattus norvegicus* | GRCr8 | 115 | GCF_036323735.1-RS_2024_02 | +| *Sus scrofa* | Sscrofa11.1 | 115 | 106 | + +> The gene ID column must be named `geneId` for mapping to be performed. If the column has a different name in the source file, ensure it is covered by the feature metadata attribute mapping above so that it is renamed to `geneId` before this step runs. \ No newline at end of file diff --git a/docs/user-guide/doc-odm-user-guide/configuration-reference.md b/docs/user-guide/doc-odm-user-guide/configuration-reference.md new file mode 100644 index 0000000..e1f6742 --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/configuration-reference.md @@ -0,0 +1,145 @@ +# Configuration Reference: Single-Cell HDF5 Transformation + +> **Related documentation:** [About SC HDF5 Transformations](about-sc-hdf5-transformations.md) · [How-to Guides](how-to-sc-hdf5-transformations.md) · [API Reference](api-reference.md) · [Transformation Process Reference](transformation-process-reference.md) + +The configuration is validated at the start of every run. If `file_type` is missing or invalid, the pipeline raises an error immediately. All other validation errors are collected and reported together. Unrecognised keys are ignored with a warning. + +--- + +## Top-level parameters + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `file_type` | `string` | **Yes** | — | Format of the input file. Accepted values: `"h5ad"`, `"h5"`. | +| `save_logs` | `boolean` | No | `true` | When `false`, logs are not saved as an attachment after the run. Has no effect when the job is submitted with `dry_run: true`. | + +--- + +## `biosample_metadata` + +Settings for extracting, transforming, and exporting cell-level metadata to Sample, Library, or Preparation entities. The entire section is optional. + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `metadata_keys` | `dict[string, string]` | Yes | — | Maps HDF5 group keys to metadata types. Use `"obs": "metadata"` to read standard cell metadata. | +| `biosample_column_name` | `string` | Yes | — | Column identifying which biosample each cell belongs to. Rows are grouped by this column for aggregation. | + +**`metadata_keys` example:** +```json +{ "obs": "metadata" } +``` + +### `biosample_metadata.sample` + +Settings for exporting metadata to the Sample entity. Optional. + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `create_new_group` | `boolean` | `false` | When `true`, creates a new Sample group in ODM and links it to the study. | +| `template_id` | `string` | — | Template ID for the new Sample group. Falls back to the study default if omitted. | +| `columns_to_export` | `list[string]` | — | Cell metadata columns to include in the exported Sample metadata. Only columns constant per biosample are eligible; exported columns are dropped from cell metadata. | +| `columns_renaming_map` | `dict[string, string]` | — | Maps source column names to new names in the exported metadata. | +| `columns_to_fill_missing_values` | `dict[string, string]` | — | Default values for missing entries in specified columns. | +| `columns_to_curate_values` | `dict[string, dict[string, string]]` | — | Maps specific values in a column to replacement values. | + +**Examples:** +```json +{ "columns_renaming_map": { "tissue_type": "tissueType" } } +{ "columns_to_fill_missing_values": { "disease": "unknown" } } +{ "columns_to_curate_values": { "tissue": { "PBMCs": "peripheral blood mononuclear cells" } } } +``` + +### `biosample_metadata.library` + +Accepts the same parameters as `biosample_metadata.sample`, plus: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `linking_group` | `string` | — | Accession of an existing Sample group to link the new Library group to. If omitted, the pipeline uses a Sample group from the same run or pre-fetched accessions. | + +### `biosample_metadata.preparation` + +Accepts the same parameters as `biosample_metadata.library`, including `linking_group`. + +> **Constraint:** Only one of `library` or `preparation` may have `columns_to_export` set in the same configuration. + +--- + +## `cell_metadata` + +Settings for extracting and transforming cell-level metadata. Optional. If absent, no Cell Group is created. + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `metadata_keys` | `dict[string, string]` | Yes | — | Maps HDF5 group keys to metadata types. At least one key with value `"metadata"` is required. | +| `linking_group` | `dict[string, string \| list[string] \| null]` | No | — | Specifies the parent SLP entity (`sample`, `library` or `preparation`) to link the Cell Group to. Empty value triggers auto-discovery of all available accessions. For full linking resolution rules, see [Linking group determination](transformation-process-reference.md#13-linking-group-determination). | +| `columns_to_drop` | `list[string]` | No | — | Column names to remove before processing. | +| `columns_renaming_map` | `dict[string, string]` | No | — | Maps source column names to new names. | +| `columns_to_fill_missing_values` | `dict[string, string]` | No | — | Default values for missing entries. | +| `columns_to_curate_values` | `dict[string, dict[string, string]]` | No | — | Replacement values for specific entries in specified columns. | +| `set_column_value` | `dict[string, string]` | No | — | Sets a constant value for all rows. Can add new columns or overwrite existing ones. | +| `columns_to_preserve_name` | `list[string]` | No | — | Columns to exempt from internal name standardisation (e.g. Leiden cluster columns with decimal suffixes). | +| `add_qc_metrics` | `boolean` | No | `true` | When `true`, adds QC metrics (counts, genes, mitochondrial/ribosomal presence) if not already present. Skipped when the job is submitted with `dry_run: true`. | + +**`metadata_keys` accepted values (H5AD):** + +| Key | Value | Description | +|-----|-------|-------------| +| `obs` | `metadata` | Standard cell annotations | +| `obsm` | `embedding` | Multidimensional cell data (PCA, UMAP, etc.) | +| `obsp` | `pairwise` | Pairwise cell annotations | + +For H5 files, use the same H5AD key names — the transformation maps them to the correct internal structure. + +**Examples:** +```json +{ "metadata_keys": { "obs": "metadata", "obsm": "embedding" } } +{ "linking_group": { "library": "GSF017080" } } +{ "columns_to_drop": ["taxon", "organism_id"] } +{ "columns_renaming_map": { "sample": "batch", "pctmt": "percentMito" } } +{ "set_column_value": { "sample_id": "lung_1" } } +{ "columns_to_preserve_name": ["cluster_leiden_0.5"] } +``` + +--- + +## `feature_metadata` + +Settings for extracting and transforming feature (gene)-level metadata. Optional. + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `metadata_keys` | `dict[string, string]` | Yes | — | Maps HDF5 group keys to metadata types. At least one key with value `"metadata"` is required. | +| `columns_to_drop` | `list[string]` | No | — | Column names to remove from feature metadata. | +| `columns_renaming_map` | `dict[string, string]` | No | — | Maps source column names to new names. | +| `columns_to_fill_missing_values` | `dict[string, string]` | No | — | Default values for missing entries. | +| `columns_to_curate_values` | `dict[string, dict[string, string]]` | No | — | Replacement values for specific entries. | +| `set_column_value` | `dict[string, string]` | No | — | Sets a constant value for all rows. | +| `columns_to_preserve_name` | `list[string]` | No | — | Columns to exempt from internal name standardisation. | +| `map_gene_ids_to_names` | `boolean` | No | `true` | When `true`, maps gene IDs to gene names if names are absent and `geneId` column is present. Set to `false` for proteomics or non-gene-ID data. | + +**`metadata_keys` accepted values (H5AD):** + +| Key | Value | Description | +|-----|-------|-------------| +| `var` | `metadata` | Standard feature annotations | +| `varm` | `embedding` | Multidimensional feature data | +| `varp` | `pairwise` | Pairwise feature annotations | + +--- + +## `cell_expression` + +Settings for extracting and uploading the cell expression matrix. Optional. If absent, no Expression Group is created. + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `data_class` | `string` | **Yes** | — | Data class label for the expression data (e.g. `"Single-cell transcriptomics"`). | +| `compression_level` | `integer` (0–9) | No | `4` | Brotli compression level. Higher values produce smaller files at the cost of longer compression time. | +| `chunk_size` | `integer` | No | inferred | Number of features processed per chunk. Calculated automatically from available memory if omitted. | +| `max_buffer_size` | `integer` | No | `50` | Amount of data (in MB) held in memory before being flushed to disk during writing. | +| `number_format` | `string` | No | inferred | Numeric precision of output values. Accepts printf-style (`"%.7g"`, `"%d"`) or NumPy dtype (`"float32"`, `"int64"`). | +| `columns_to_drop` | `list[string]` | No | — | Column names to remove from expression metadata. | +| `columns_renaming_map` | `dict[string, string]` | No | — | Maps source column names to new names. | +| `set_column_value` | `dict[string, string]` | No | — | Sets a constant value for all rows in specified columns. | +| `source_file_metadata` | `boolean` | No | `true` | When `true`, metadata from the source HDF5 attachment is read and included in expression metadata. Summary statistics (cell count, feature count, sparsity, etc.) are always appended regardless of this flag. | diff --git a/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/GSE156793.json b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/GSE156793.json new file mode 100644 index 0000000..459f4de --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/GSE156793.json @@ -0,0 +1,111 @@ +{ + "name": "GSE156793.json", + "description": "Config to transform GSE156793 dataset", + "data": { + "file_type": "h5ad", + "biosample_metadata": { + "metadata_keys": { + "obs": "metadata" + }, + "biosample_column_name": "RT_group", + "sample": { + "create_new_group": false, + "template_id": null, + "linking_group": null, + "columns_to_export": [ + "Fetus_id", + "Development_day" + ], + "columns_renaming_map": { + "Fetus_id": "Donor ID", + "Development_day": "Donor Age" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null + }, + "library": { + "create_new_group": false, + "template_id": null, + "linking_group": null, + "columns_to_export": [ + "Assay" + ], + "columns_renaming_map": { + "Assay": "Assay Type" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null + } + }, + "cell_metadata": { + "metadata_keys": { + "obs": "metadata", + "obsm": "embedding", + "obsp": "pairwise" + }, + "linking_group": null, + "columns_to_drop": [ + "batch", + "Organ", + "Sex", + "Batch", + "Experiment_batch" + ], + "columns_renaming_map": { + "_index": "barcode", + "RT_group": "batch", + "Main_cluster_name": "cluster", + "Organ_cell_lineage": "cell_type" + }, + "columns_to_curate_values": { + "matched_mca_cell_name": { + "nan": "" + }, + "bca_cluster_info": { + "nan": "" + }, + "matched_bca_cell_name": { + "nan": "" + }, + "X_umap": { + "nan,nan": "" + } + }, + "columns_to_fill_missing_values": { + "batch": "unknown" + }, + "columns_to_preserve_name": [ + "X_umap" + ], + "add_qc_metrics": true + }, + "feature_metadata": { + "metadata_keys": { + "var": "metadata", + "varm": "embedding", + "varp": "pairwise" + }, + "columns_to_drop": null, + "columns_renaming_map": { + "_index": "geneId", + "gene_short_name": "gene" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "map_gene_ids_to_names": true + }, + "cell_expression": { + "data_class": "Single-cell transcriptomics", + "compression_level": null, + "chunk_size": null, + "max_buffer_size": null, + "number_format": null, + "columns_to_drop": null, + "columns_renaming_map": null, + "set_column_value": null, + "source_file_metadata": true + } + } +} diff --git a/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/GSE165045.json b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/GSE165045.json new file mode 100644 index 0000000..8a7b0ce --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/GSE165045.json @@ -0,0 +1,49 @@ +{ + "name": "GSE165045.json", + "description": "Config to transform GSE165045 dataset", + "data": { + "file_type": "h5ad", + "biosample_metadata": null, + "cell_metadata": { + "metadata_keys": { + "obs": "metadata" + }, + "linking_group": null, + "columns_to_drop": null, + "columns_renaming_map": { + "sample": "batch", + "_index": "barcode" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "add_qc_metrics": true + }, + "feature_metadata": { + "metadata_keys": { + "var": "metadata" + }, + "columns_to_drop": null, + "columns_renaming_map": { + "_index": "gene" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "map_gene_ids_to_names": true + }, + "cell_expression": { + "compression_level": null, + "chunk_size": null, + "max_buffer_size": null, + "data_class": "Single-cell transcriptomics", + "number_format": null, + "columns_to_drop": null, + "columns_renaming_map": null, + "set_column_value": null, + "source_file_metadata": true + } + } +} diff --git a/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_1.json b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_1.json new file mode 100644 index 0000000..9a6c90c --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_1.json @@ -0,0 +1,83 @@ +{ + "name": "aggregated_config_1.json", + "description": "Aggregated config 1 to transform several public datasets", + "data": { + "file_type": "h5ad", + "biosample_metadata": null, + "cell_metadata": { + "metadata_keys": { + "obs": "metadata", + "obsm": "embedding", + "obsp": "pairwise" + }, + "linking_group": null, + "columns_to_drop": [ + "barcode", + "Species", + "sex", + "age", + "disease", + "biosample_id", + "lvef" + ], + "columns_renaming_map": { + "index": "barcode", + "_index": "barcode", + "donor_id": "batch", + "sample": "batch", + "sample_id": "batch", + "Sample_Name": "batch", + "biological.individual": "batch", + "GSM_ID": "gsm_id", + "cell_type_leiden0.6": "cell_type", + "SubCluster": "cluster", + "cellbender_ncount": "n_counts", + "cellbender_ngenes": "n_genes", + "cellranger_percent_mito": "percent_mito", + "cellbender_entropy": "entropy", + "cellranger_doublet_scores": "doublet_scores" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "add_qc_metrics": true + }, + "feature_metadata": { + "metadata_keys": { + "var": "metadata", + "varm": "embedding", + "varp": "pairwise" + }, + "columns_to_drop": [ + "feature_biotype", + "feature_types", + "genome" + ], + "columns_renaming_map": { + "_index": "gene", + "index": "gene", + "GENE": "gene", + "var_index": "geneId", + "feature_is_filtered": "is_filtered" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "map_gene_ids_to_names": true + }, + "cell_expression": { + "compression_level": null, + "chunk_size": null, + "max_buffer_size": null, + "data_class": "Single-cell transcriptomics", + "number_format": null, + "columns_to_drop": null, + "columns_renaming_map": null, + "set_column_value": null, + "source_file_metadata": true + } + } +} + diff --git a/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_2.json b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_2.json new file mode 100644 index 0000000..064fbd0 --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_2.json @@ -0,0 +1,107 @@ +{ + "name": "aggregated_config_2.json", + "description": "Aggregated config 2 to transform several public datasets", + "data": { + "file_type": "h5ad", + "biosample_metadata": { + "metadata_keys": { + "obs": "metadata" + }, + "biosample_column_name": "sample", + "sample": { + "create_new_group": false, + "template_id": null, + "linking_group": null, + "columns_to_export": [ + "sex_ontology_term_id", + "development_stage_ontology_term_id", + "ethnicity_ontology_term_id", + "HbA1c", + "insulin_content", + "glucose_SI" + ], + "columns_renaming_map": { + "sex_ontology_term_id": "Donor Sex Term ID", + "development_stage_ontology_term_id": "Developmental Stage Term ID", + "ethnicity_ontology_term_id": "Donor Ethnicity Term ID", + "HbA1c": "Hemoglobin A1c (HbA1c) Concentration Value", + "insulin_content": "Fasting Insulin Concentration Value", + "glucose_SI": "Fasting Glucose Concentration Value" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null + }, + "library": { + "create_new_group": false, + "template_id": null, + "linking_group": null, + "columns_to_export": [ + "assay_ontology_term_id" + ], + "columns_renaming_map": { + "assay_ontology_term_id": "Assay Type Term ID" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null + } + }, + "cell_metadata": { + "metadata_keys": { + "obs": "metadata", + "obsm": "embedding" + }, + "linking_group": null, + "columns_to_drop": [ + "id", + "BMI", + "organism_ontolology_term_id", + "disease_ontology_term_id", + "is_primary_data", + "tissue_ontology_term_id" + ], + "columns_renaming_map": { + "_index": "barcode", + "sample": "batch", + "louvain_anno_broad": "louvain", + "louvain_anno_fine": "louvain_fine", + "cell_type_ontology_term_id": "cell_type", + "mt_frac": "percent_mito" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "add_qc_metrics": true + }, + "feature_metadata": { + "metadata_keys": { + "var": "metadata" + }, + "columns_to_drop": [ + "feature_biotype" + ], + "columns_renaming_map": { + "ensembl_ID": "geneId", + "human_ensembl_ID": "human_ensembl_id", + "feature_is_filtered": "is_filtered", + "filtered_mapped_human_ensembl_ID": "filtered_mapped_human_ensembl_id" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "map_gene_ids_to_names": true + }, + "cell_expression": { + "compression_level": null, + "chunk_size": null, + "max_buffer_size": null, + "data_class": "Single-cell transcriptomics", + "number_format": null, + "columns_to_drop": null, + "columns_renaming_map": null, + "set_column_value": null, + "source_file_metadata": true + } + } +} \ No newline at end of file diff --git a/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_3.json b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_3.json new file mode 100644 index 0000000..b9d9c9e --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/aggregated_config_3.json @@ -0,0 +1,117 @@ +{ + "name": "aggregated_config_3.json", + "description": "Aggregated config 3 to transform several public datasets", + "data": { + "file_type": "h5ad", + "biosample_metadata": { + "metadata_keys": { + "obs": "metadata" + }, + "biosample_column_name": "sample_id", + "sample": { + "create_new_group": false, + "template_id": null, + "linking_group": null, + "columns_to_export": [ + "Condition", + "self_reported_ethnicity_ontology_term_id", + "tissue_type" + ], + "columns_renaming_map": { + "Condition": "Condition Group", + "self_reported_ethnicity_ontology_term_id": "Donor Ethnicity Term ID", + "tissue_type": "Cell Source" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": { + "Sample Source ID": { + "AM031": "Liver-32", + "AM042": "Liver-13", + "AM048": "Liver-14", + "AM061": "Liver-18", + "AM062": "Liver-33", + "AM072": "Liver-34" + } + } + }, + "library": { + "create_new_group": null, + "template_id": null, + "linking_group": null, + "columns_to_export": null, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null + } + }, + "cell_metadata": { + "metadata_keys": { + "obs": "metadata", + "obsm": "embedding", + "obsp": "pairwise" + }, + "linking_group": null, + "columns_to_drop": [ + "barcode", + "Sex", + "Age", + "batch", + "organism_ontology_term_id", + "donor_id", + "development_stage_ontology_term_id", + "sex_ontology_term_id", + "disease_ontology_term_id", + "tissue_ontology_term_id" + ], + "columns_renaming_map": { + "_index": "barcode", + "sample_id": "batch", + "log10GenesPerUMI_injured": "log10_genes_per_umi_injured", + "CellType_injured": "cell_type_injured", + "log10GenesPerUMI_healthy": "log10_genes_per_umi_healthy", + "CellType_healthy": "cell_type_healthy", + "cell_type_ontology_term_id": "cell_type" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": { + "batch": { + "AM031": "Liver-32", + "AM042": "Liver-13", + "AM048": "Liver-14", + "AM061": "Liver-18", + "AM062": "Liver-33", + "AM072": "Liver-34" + } + }, + "set_column_value": null, + "columns_to_preserve_name": null, + "add_qc_metrics": true + }, + "feature_metadata": { + "metadata_keys": { + "var": "metadata", + "varm": "embedding", + "varp": "pairwise" + }, + "columns_to_drop": null, + "columns_renaming_map": { + "_index": "gene" + }, + "columns_to_fill_missing_values": null, + "columns_to_curate_values": null, + "set_column_value": null, + "columns_to_preserve_name": null, + "map_gene_ids_to_names": true + }, + "cell_expression": { + "compression_level": null, + "chunk_size": null, + "max_buffer_size": null, + "data_class": "Single-cell transcriptomics", + "number_format": "float32", + "columns_to_drop": null, + "columns_renaming_map": null, + "set_column_value": null, + "source_file_metadata": true + } + } +} \ No newline at end of file diff --git a/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/dataset-import-commands.md b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/dataset-import-commands.md new file mode 100644 index 0000000..fd17906 --- /dev/null +++ b/docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/extras/dataset-import-commands.md @@ -0,0 +1,195 @@ +# Curated Public Datasets: Import Commands + +The commands below load each curated single-cell dataset into an ODM instance using the `odm-import-data` CLI. Each command uploads study and sample/library metadata alongside the H5AD attachment, ready for transformation. + +Replace ``, ``, and `