Skip to content

divkila/Teiko_Technical

Repository files navigation

Teiko Technical

Intructions to Run the Code and Reproduce Outputs

1. Install dependencies

make setup

2. Run the full data pipeline (initializes the database, loads data, and generates all results)

make pipeline

3. Launch the dashboard

make dashboard

When running in GitHub Codespaces, port 8501 will be forwarded automatically and a browser tab will open with the dashboard URL. If it does not open automatically, go to the Ports tab in the Codespaces panel and click the forwarded address for port 8501.

All pipeline outputs are written to the results/ folder:

  • results/freq_summary_table.csv: relative cell population frequencies per sample (Part 2)
  • results/boxplot.png: responder vs non-responder boxplot (Part 3)
  • results/data_subsets.txt: baseline melanoma + miraclib subset analysis (Part 4)

Explanation of Schema

The database (cell_counts.db) uses a normalized relational schema with four tables:

projects: one row per clinical project (project_id as primary key).

subjects: one row per patient, storing all subject-level clinical metadata: condition, age, sex, treatment, and response. Each subject belongs to one project via a foreign key. Subject attributes are stored here rather than repeated on every sample row, so a patient's metadata is stored exactly once regardless of how many samples they contribute.

samples: one row per blood draw, storing sample_type and time_from_treatment_start. Each sample belongs to one subject via a foreign key.

cell_counts: one row per cell population per sample, in long format. Rather than storing each cell type as a separate column, each row has a cell_type and count field. This means adding a new cell population requires inserting new rows, not altering the table schema.

Scalability rationale:

  • Efficiently querying across hundreds of projects: the projects table is a lightweight lookup. It filters by project_id without scanning unrelated data. Indexes on foreign keys keep joins fast.
  • Organizing large amounts of samples: normalizing subject metadata means it is stored once per subject, not once per sample, keeping the subjects table concise even as samples grows.
  • Appending new data for analytics: the long-format cell_counts table supports arbitrary aggregations by cell type, timepoint, condition, treatment, or response with standard SQL joins, no schema changes needed. New cell types, sample types, or subject-level covariates can be added by inserting rows rather than altering the table structure.

Code Structure Overview

load_data.py       # Part 1: initializes the SQLite database and loads cell-count.csv
data_analysis.py   # Parts 2-4: computes frequencies, runs statistics, generates outputs
dashboard.py       # Streamlit dashboard displaying all results interactively
cell-count.csv     # Input data
requirements.txt   # Python dependencies
Makefile           # Defines setup, pipeline, and dashboard targets
results/           # Generated outputs (created by make pipeline)

load_data.py is kept separate from the analysis as required by the specifications. Re-running it deletes and rebuilds the database from scratch so results are always reproducible.

data_analysis.py is structured as a set of independent functions for each part. This makes each step easy to test and reason each part of the analysis in an isolated way. All outputs are written to results/ so the dashboard can read from files rather than re-running the potentially queries.

dashboard.py reads pre-generated files from results/ where possible, and only queries the database directly for lightweight lookups (the Part 3 cohort filter and Part 4 subset breakdowns). This avoids memory issues from loading large DataFrames at dashboard startup. Results are organized into three labeled tabs corresponding to Parts 2, 3, and 4.

Link to Dashboard

The dashboard runs locally via Streamlit. To view it, run make dashboard and navigate to the URL shown in the terminal. In GitHub Codespaces, a prompt will appear to open the forwarded port URL, which will navigate to the live dashboard.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors