Takuan is a REST API data service for transcriptomics data.
It is intended to ingest, organize and query data from transcriptomics experiments through an API.
Takuan stores its data in a PostgreSQL database.
Takuan handles data produced in transcriptomics experiments.
For a given experiment, samples taken from study participants are selected for RNA sequencing.
At the end of an RNA sequencing pipeline, results are usually stored in TSV/CSV format, Takuan handles 2 result formats:
- Multi-sample Raw Count Matrices (RCM)
- Defines the expression levels for each feature (gene) and sample pair
- Can only ingest one count type at a time (raw, TPM, TMM, GETMM or FPKM)
- Single-sample detailled counts
- Defines the expression levels for each feature (gene) for the given sample
- Can ingest all count types at once (raw, TPM, TMM, GETMM and FPKM)
Once the data is produced, it can be ingested in Takuan in order to allow downstream analysis of the results.
Takuan expects to receive RCM files in CSV format, where the colums correspond to unique sample identifiers, rows to unique feature identifiers (genes) and cells to the observed count for the sample-gene pair.
Takuan expects to receive RCM files in CSV or TSV format, where the columns correspond to specific expression measures, and the rows to feature IDs (genes).
For example:
gene_id raw_count tpm_count tmm_count getmm fpkm_count
ENSG00000000003 0 0 0 0 0
ENSG00000000005 1559 2.3567 0.2369 7.566 0.369
...
With single-sample data files, you can save time by ingesting the raw counts and the pre-normalised values at the same time.
The single-sample endpoint even supports column headers mappings, allowing you to flexibly ingest files that don't share header names.
In order to ingest and query data into Takuan, you must follow these steps:
- Create an
experiment, in which we will later ingest gene expression data- POST
/experiment - JSON body describing the experiment, where
experiment_result_idis a unique identifier for the experimentassembly_idis the assembly accession ID used in the experimentassembly_nameis the genome assembly name used in the experimentextra_propertiesis a JSON object where you can place additional meta data
- POST
- Ingest an RCM into the experiment you created
- POST
/experiment/{experiment_result_id}/ingest- Where
experiment_result_idmust correspond to an existing experiment ID in Takuan - A valid RCM file must be in the request's body as
rcm_file
- Where
- During the ingestion, Takuan creates a
gene_expressionrow for every pair of sample-gene
- POST
- OR ingest single-sample data
- POST
/experiment/{experiment_result_id}/ingest/single- Where
experiment_result_idmust correspond to an existing experiment ID in Takuan - A valid TSV/CSV file in the request body as
data
- Where
- During the ingestion, Takuan creates a
gene_expressionrow for every expression row in the file
- POST
- The
gene_expressiontable now contains rows with theraw_countcolumn filled - (Optional) Normalized counts can be computed on demand and stored in the database
- POST
/normalize/{experiment_result_id}/{method}experiment_result_idis the ID of an experiment with raw gene expressionsmethodis the normalization method to use (TPM, TMM or GETMM)TPMandGETMMboth require that you include agene_lengthsCSV file in the body
- Normalized values are added in the appropriate column of
gene_expression
- POST
- Query the experiments and gene expressions in your DB!
- POST
/expressionsto get expression data results- JSON request body for filtering results and pagination
- POST
experiment/{experiment_result_id}/samplesto get the sample IDs for an experiment - POST
experiment/{experiment_result_id}/featuresto get the gene IDs for an experiment
- POST
Valid synthetic transcriptomics data can be produced for Takuan.
To do so, please follow the steps detailled in the Bento Demo Dataset repository.
After following the instructions, 4 relevant files are produced for Takuan:
counts_matrix_group_1.csvcounts_matrix_group_2.csvcounts_matrix_group_3.csvgene_lengths.csv
The count matrices can be ingested into Takuan as is. The gene lengths file can be used as is to normalize ingested expressions.
For testing purposes, this repository includes an RCM and gene lenghts file:
The following environment variables should be set when running a Takuan container:
| Name | Description | Default |
|---|---|---|
AUTHZ_ENABLED |
Enables/disables the authorization plugin. | False |
CORS_ORIGINS |
List of allowed CORS origins | Null |
DB_HOST |
IP or hostname of the database | tds-db |
DB_PORT |
Database port | 5432 |
DB_USER |
Database username | tds_user |
DB_NAME |
Database name | tds |
DB_PASSWORD |
DB_USER's Database password | Null |
DB_PASSWORD_FILE |
Docker secret file for DB_USER's Database password | Null |
TDS_USER_NAME |
Non-root container user name running the server process | Null |
TDS_UID |
UID of TDS_USER_NAME | 1000 |
Note: Only use DB_PASSWORD or DB_PASSWORK_FILE, not both, since they serve the same purpose in a different fashion.
The Takuan Config object has its values populated from environment variables and secrets at startup.
The Config.db_password value is populated by either:
DB_PASSWORD=<a secure password>if using an environment variable- As seen in docker-compose.dev.yaml
DB_PASSWORD_FILE=/run/secrets/db_passwordif using a Docker secret (recommended)- As seen in docker-compose.secrets.dev.yaml
Using a Docker secret is recommended for security, as environment variables are more prone to be leaked.
DB_PASSWORD should only be considered for local development, or if the database is secured and isolated from public access in a private network.
The Transcriptomics Data Service is meant to be a reusable microservice that can be integrated in existing stacks. Since authorization schemes vary across projects, Takuan allows adopters to code their own authorization plugin, enabling adopters to leverage their existing access control code, tools and policies.
See the authorization docs for more information on how to create and use the authz plugin with Takuan.
Start the Takuan server with a local PostgreSQL database for testing by running the following command.
# start
docker compose up --build -d
# stop
docker compose downThe --build argument forces the image to be rebuilt. Be sure to use it if you want code changes to be present.
You can now interact with the API by querying localhost:5000/{endpoint}
For the OpenAPI browser interface, go to localhost:5000/docs.
For local development, you can use the docker-compose.dev.yaml file to start a Takuan
development container that mounts the local directory.
The server starts in reload mode to quickly reflect local changes, and debugpy is listening on the container's internal port 9511.
# Set UID for directory permissions in the container
export UID=$(id -u)
# start
docker compose -f ./docker-compose.dev.yaml up --build -d
# stop
docker compose -f ./docker-compose.dev.yaml downYou can then attach VS Code to the takuan container, and use the preconfigured Python Debugger (Takuan) for interactive debugging.
This service implements GA4GH's Service-Info spec.
If left unconfigured, a default service info object will be returned.
For adopters outside of the Bento stack, we recommend that you provide a custom service info object when deploying.
This can be done by simply mounting a JSON file in the Takuan container.
When starting, the service will look for a JSON file at /tds/lib/service-info.json.
If the file exists, it will be served from the GET /service-info endpoint, otherwise the default is used.
The service exposes the following endpoints:
| Endpoint | Method | Description |
|---|---|---|
/experiment |
GET | Get all experiments |
/experiment |
POST | Create an experiment |
/experiment/{experiment_result_id} |
GET | Get an experiment by unique ID |
/experiment/{experiment_result_id} |
DELETE | Delete an experiment by unique ID |
/experiment/{experiment_result_id}/samples |
POST | Retrieve the samples for a given experiment |
/experiment/{experiment_result_id}/features |
POST | Retrieve the features for a given experiment |
/experiment/{experiment_result_id}/ingest |
POST | Ingest multi-sample transcriptomics data into an experiment |
/experiment/{experiment_result_id}/ingest/single |
POST | Ingest single-sample transcriptomics data into an experiment |
/normalize/{experiment_result_id}/{method} |
POST | Normalize an experiment's gene expressions with one of the supported methods (TPM, TMM, GETMM) |
/expressions |
POST | Retrieve expressions with filter parameters |
/service-info |
GET | Returns a GA4GH service-info object describing the service |
Note: For a more thorough API documentation, please refer to the OpenAPI release artifacts (openapi.json), or consult the hosted docs (link to come).
An openapi.json file is produced and attached to every release.
A Takuan deployment can be customized by mounting certain files to the container. The table bellow lists the files that can be mounted to a Takuan container to customize its behaviour.
| Container path | Description |
|---|---|
/run/secrets/ |
Docker secrets directory |
/tds/lib/.env |
Extra environment variables for an authz plugin |
/tds/lib/authz.module.py |
Custom authorization plugin implementation |
/tds/lib/requirements.txt |
Extra Python dependencies to install for an authz plugin |
/tds/lib/service-info.json |
Custom GA4GH service-info JSON definition |
The Transcriptomics Data Service is packaged and released as a Docker image using GitHub Actions.
Images are published in GitHub's container registry, here.
Images are built and published using the following tags:
<version>: Build for a tagged releaselatest: Build for the latest tagged releaseedge: The top of themainbranchpr-<number>: Build for a pull request that targetsmain
Note: Images with the -dev suffix (e.g. edge-dev) are based on dev.Dockerfile for local development.
To pull an image, or reference it in a compose file, use this pattern:
docker pull ghcr.io/bento-platform/takuan:<TAG>List of example scripts to interact with a Takuan API:
