1. Install dependencies
make setup2. Run the full data pipeline (initializes the database, loads data, and generates all results)
make pipeline3. Launch the dashboard
make dashboardWhen running in GitHub Codespaces, port 8501 will be forwarded automatically and a browser tab will open with the dashboard URL. If it does not open automatically, go to the Ports tab in the Codespaces panel and click the forwarded address for port 8501.
All pipeline outputs are written to the results/ folder:
results/freq_summary_table.csv: relative cell population frequencies per sample (Part 2)results/boxplot.png: responder vs non-responder boxplot (Part 3)results/data_subsets.txt: baseline melanoma + miraclib subset analysis (Part 4)
The database (cell_counts.db) uses a normalized relational schema with four tables:
projects: one row per clinical project (project_id as primary key).
subjects: one row per patient, storing all subject-level clinical metadata: condition, age, sex, treatment, and response. Each subject belongs to one project via a foreign key. Subject attributes are stored here rather than repeated on every sample row, so a patient's metadata is stored exactly once regardless of how many samples they contribute.
samples: one row per blood draw, storing sample_type and time_from_treatment_start. Each sample belongs to one subject via a foreign key.
cell_counts: one row per cell population per sample, in long format. Rather than storing each cell type as a separate column, each row has a cell_type and count field. This means adding a new cell population requires inserting new rows, not altering the table schema.
Scalability rationale:
- Efficiently querying across hundreds of projects: the
projectstable is a lightweight lookup. It filters byproject_idwithout scanning unrelated data. Indexes on foreign keys keep joins fast. - Organizing large amounts of samples: normalizing subject metadata means it is stored once per subject, not once per sample, keeping the
subjectstable concise even assamplesgrows. - Appending new data for analytics: the long-format
cell_countstable supports arbitrary aggregations by cell type, timepoint, condition, treatment, or response with standard SQL joins, no schema changes needed. New cell types, sample types, or subject-level covariates can be added by inserting rows rather than altering the table structure.
load_data.py # Part 1: initializes the SQLite database and loads cell-count.csv
data_analysis.py # Parts 2-4: computes frequencies, runs statistics, generates outputs
dashboard.py # Streamlit dashboard displaying all results interactively
cell-count.csv # Input data
requirements.txt # Python dependencies
Makefile # Defines setup, pipeline, and dashboard targets
results/ # Generated outputs (created by make pipeline)
load_data.py is kept separate from the analysis as required by the specifications. Re-running it deletes and rebuilds the database from scratch so results are always reproducible.
data_analysis.py is structured as a set of independent functions for each part. This makes each step easy to test and reason each part of the analysis in an isolated way. All outputs are written to results/ so the dashboard can read from files rather than re-running the potentially queries.
dashboard.py reads pre-generated files from results/ where possible, and only queries the database directly for lightweight lookups (the Part 3 cohort filter and Part 4 subset breakdowns). This avoids memory issues from loading large DataFrames at dashboard startup. Results are organized into three labeled tabs corresponding to Parts 2, 3, and 4.
The dashboard runs locally via Streamlit. To view it, run make dashboard and navigate to the URL shown in the terminal. In GitHub Codespaces, a prompt will appear to open the forwarded port URL, which will navigate to the live dashboard.