Teiko Technical

Intructions to Run the Code and Reproduce Outputs

1. Install dependencies

make setup

2. Run the full data pipeline (initializes the database, loads data, and generates all results)

make pipeline

3. Launch the dashboard

make dashboard

When running in GitHub Codespaces, port 8501 will be forwarded automatically and a browser tab will open with the dashboard URL. If it does not open automatically, go to the Ports tab in the Codespaces panel and click the forwarded address for port 8501.

All pipeline outputs are written to the results/ folder:

results/freq_summary_table.csv: relative cell population frequencies per sample (Part 2)
results/boxplot.png: responder vs non-responder boxplot (Part 3)
results/data_subsets.txt: baseline melanoma + miraclib subset analysis (Part 4)

Explanation of Schema

The database (cell_counts.db) uses a normalized relational schema with four tables:

projects: one row per clinical project (project_id as primary key).

subjects: one row per patient, storing all subject-level clinical metadata: condition, age, sex, treatment, and response. Each subject belongs to one project via a foreign key. Subject attributes are stored here rather than repeated on every sample row, so a patient's metadata is stored exactly once regardless of how many samples they contribute.

samples: one row per blood draw, storing sample_type and time_from_treatment_start. Each sample belongs to one subject via a foreign key.

cell_counts: one row per cell population per sample, in long format. Rather than storing each cell type as a separate column, each row has a cell_type and count field. This means adding a new cell population requires inserting new rows, not altering the table schema.

Scalability rationale:

Efficiently querying across hundreds of projects: the projects table is a lightweight lookup. It filters by project_id without scanning unrelated data. Indexes on foreign keys keep joins fast.
Organizing large amounts of samples: normalizing subject metadata means it is stored once per subject, not once per sample, keeping the subjects table concise even as samples grows.
Appending new data for analytics: the long-format cell_counts table supports arbitrary aggregations by cell type, timepoint, condition, treatment, or response with standard SQL joins, no schema changes needed. New cell types, sample types, or subject-level covariates can be added by inserting rows rather than altering the table structure.

Code Structure Overview

load_data.py       # Part 1: initializes the SQLite database and loads cell-count.csv
data_analysis.py   # Parts 2-4: computes frequencies, runs statistics, generates outputs
dashboard.py       # Streamlit dashboard displaying all results interactively
cell-count.csv     # Input data
requirements.txt   # Python dependencies
Makefile           # Defines setup, pipeline, and dashboard targets
results/           # Generated outputs (created by make pipeline)

load_data.py is kept separate from the analysis as required by the specifications. Re-running it deletes and rebuilds the database from scratch so results are always reproducible.

data_analysis.py is structured as a set of independent functions for each part. This makes each step easy to test and reason each part of the analysis in an isolated way. All outputs are written to results/ so the dashboard can read from files rather than re-running the potentially queries.

dashboard.py reads pre-generated files from results/ where possible, and only queries the database directly for lightweight lookups (the Part 3 cohort filter and Part 4 subset breakdowns). This avoids memory issues from loading large DataFrames at dashboard startup. Results are organized into three labeled tabs corresponding to Parts 2, 3, and 4.

Link to Dashboard

The dashboard runs locally via Streamlit. To view it, run make dashboard and navigate to the URL shown in the terminal. In GitHub Codespaces, a prompt will appear to open the forwarded port URL, which will navigate to the live dashboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teiko Technical

Intructions to Run the Code and Reproduce Outputs

Explanation of Schema

Code Structure Overview

Link to Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
results		results
.gitattributes		.gitattributes
Makefile		Makefile
README.md		README.md
cell-count.csv		cell-count.csv
cell_counts.db		cell_counts.db
dashboard.py		dashboard.py
data_analysis.py		data_analysis.py
load_data.py		load_data.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Teiko Technical

Intructions to Run the Code and Reproduce Outputs

Explanation of Schema

Code Structure Overview

Link to Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages