E-commerce Recommendation Intelligence Lab

A production-style ML recommender system demonstrating multiple recommendation algorithms, offline evaluation, A/B experimentation, and a Plotly Dash dashboard — designed to run fully on a 2-core GitHub Codespaces environment.

Quick Start

# 1. Install dependencies
make install

# 2. Generate synthetic data (~100k users, ~10k products, ~3M interactions)
make generate-data

# 3. Train all models
make train-models

# 4. Generate batch recommendations
make generate-recommendations

# 5. Evaluate models
make evaluate

# 6. Run A/B experiment
make run-experiments

# 7. Launch dashboard
make dashboard
# Open http://localhost:8050

Or run the entire pipeline in one command (skipping dashboard):

make all

What Is a Recommender System?

When you open a streaming service and see "Recommended for you", or shop online and notice "Customers also bought", you are interacting with a recommender system. Its job is to predict which items — products, videos, articles — a specific user is most likely to find relevant, out of potentially millions of options.

A recommender system learns patterns from past behaviour (clicks, purchases, time spent) and uses them to rank items for each user. The better the ranking, the more likely users are to engage, which benefits both users (less searching) and businesses (more conversions).

Two Types of Items: Products and Editorial Content

This project models two distinct kinds of items that a real e-commerce platform might serve to users.

Products

~10,000 items, each described by category, subcategory, brand, and price_tier. These are the goods for sale. Recommendations here drive direct revenue. All five recommendation algorithms are trained exclusively on product interactions, and evaluation metrics (Precision, Recall, NDCG) are computed against product interactions only.

Editorial Content

~1,000 items representing non-transactional content: articles (50%), posts (20%), collections (15%), and buying guides (15%). Each piece of editorial carries a category (aligned with the product taxonomy) and 2–3 topic_tags such as "trending", "how_to", "gift_guide", or "seasonal".

Users interact with editorial content through clicks and reads — not purchases. Its role in the system is:

Discovery and warm-up: a user who reads a "best running shoes" guide is signalling interest in the sports category before making any purchases. These signals populate the interaction log and contribute to a more realistic user journey.
Realism in the interaction funnel: without editorial content, every interaction would imply purchase intent. In reality, a large share of platform activity is content consumption — browsing, reading, and researching. The 20% editorial / 80% product split in interaction generation reflects this.
Separate popularity dynamics: editorial items use a slightly flatter power-law popularity curve (exponent 1.1) than products (exponent 1.2), meaning editorial content has somewhat more uniform readership — a niche guide can accumulate reads more easily than a niche product accumulates purchases.

How the synthetic data is designed to produce learnable patterns

A naive synthetic dataset — where every user draws items uniformly from a global popularity distribution — gives collaborative filtering nothing to learn. The popularity model would always win because all users look identical: the most popular items are relevant for everyone.

This dataset uses three mechanisms to create genuine personalisation signal:

1. Two-level user affinity Every user has a primary_category (e.g. electronics) and a preferred subcategory within it (e.g. laptops). 70% of their product interactions come from their primary category, 60% of those concentrate on their preferred subcategory. Users who share (primary_category, subcategory) build heavily overlapping interaction histories — exactly the co-occurrence pattern that item-CF and ALS need to learn "users like you also bought X".

2. Item lifecycle + seasonality Each item has a random peak month and an exponential decay — items trend and then fade. Category-level seasonal multipliers boost relevant items at the right time of year (toys in November, garden products in spring). This means item popularity shifts over time, rewarding models that learn temporal patterns rather than just counting total interactions.

3. Power-law session sizes via Zipf distribution Session slots are drawn from a Zipf(1.5) distribution: most interactions land in a user's "main session" (slot 0), with progressively fewer in subsequent sessions. The resulting session-size distribution has a heavy tail matching real e-commerce behaviour — a few very long browsing sessions and many quick check-ins.

The `n_editorial` parameter

n_editorial in config/config.yaml controls the size of the editorial catalog. It is intentionally small (1,000 vs 10,000 products) for two reasons:

Content turns over slowly. A buying guide or seasonal collection stays relevant for months; a product listing can change weekly. A smaller catalog reflects the slower publication cadence of editorial teams.
Item IDs are shared in the interaction log. Each interaction event records an item_id and an item_type. For editorial events, item_id refers to a content_id (0–999); for product events, it refers to a product_id (0–9,999). Keeping editorial IDs in their own smaller range makes the two namespaces easy to distinguish during analysis.

If you increase n_editorial, the editorial popularity weights are recalculated automatically — no other code changes are needed.

Recommendation Algorithms

This project implements five algorithms that represent the main families used in industry today.

Popularity Baseline

The simplest possible recommender: rank all products by how often they have been interacted with recently, and show the same top list to every user. It does not personalise at all, but it sets a useful floor — any smarter model should beat it.

Strength: Simple, always has recommendations, works for new users with no history. Weakness: Everyone gets the same list; popular items dominate and niche products are never surfaced.

Why popularity is the baseline (control) model

In any experiment, the control is the thing you are already doing — or the simplest thing you could do. Popularity fits that role for three reasons.

It always works. A personalised model that has not seen a user before (cold start) cannot produce recommendations. Popularity can always return something, making it a practical fallback in production and an honest lower bound in experiments.
It is not trivially bad. Showing popular products is genuinely useful: popular items have broad appeal, tend to be in stock, and are often well-reviewed. Any new model that cannot consistently beat popularity in both offline metrics and live experiments does not justify the added complexity.
It is easy to understand and explain. When a model beats popularity by 40% in CTR, stakeholders can immediately grasp what that means. A baseline that is hard to reason about makes experiment results harder to interpret and trust.

This is the standard industry practice: Netflix, Spotify, and Amazon all use popularity-based recommendations as the default for new users and as the benchmark every new algorithm must beat before it is promoted to production.

Collaborative Filtering — ALS (Alternating Least Squares)

Collaborative filtering is based on the idea that users who agreed in the past will agree in the future. ALS is a matrix factorisation technique: it compresses the giant user-item interaction matrix into two smaller matrices of latent factors (hidden taste dimensions), then reconstructs scores by multiplying them together.

"Alternating" refers to how it trains: it fixes user factors and solves for item factors, then fixes item factors and solves for user factors, repeating until convergence. This project uses the implicit library, which is optimised for implicit feedback (clicks and purchases rather than explicit star ratings).

Strength: Captures subtle taste patterns; scales well; purely data-driven. Weakness: Cold-start problem — cannot recommend to brand-new users with no history.

Collaborative Filtering — Item-CF (Item-based)

Instead of finding similar users, Item-CF finds similar items: if many users who bought item A also bought item B, then A and B are similar. To generate recommendations for a user, it looks at the items they have already interacted with and returns the most similar items they have not yet seen.

This project uses cosine similarity on the user-item matrix columns, again via the implicit library.

Strength: Recommendations are easy to explain ("because you liked X"); item similarities are stable over time. Weakness: Less able to surface surprising or cross-category discoveries.

Content-Based Filtering

Rather than looking at what other users did, content-based filtering looks at the properties of items themselves. Each product is described by its category, subcategory, brand, and price tier. These are converted into a TF-IDF vector (a weighted word-count representation), and a user's profile is built by averaging the vectors of products they have interacted with. Recommendations are items whose vectors are most similar (via cosine similarity) to that profile.

Strength: Works for new items immediately; no need for other users' data; naturally explainable. Weakness: Tends to recommend more of the same — if you only ever bought electronics, it will only ever suggest electronics.

Hybrid Recommender

The hybrid model blends all four other algorithms — popularity, ALS, item-CF, and content-based — in a single ranked list. Each model first produces its own scored candidate list; those scores are independently normalised to [0, 1] and then combined with configurable weights (defaults: popularity 10%, ALS 40%, item-CF 30%, content-based 20%).

By blending all four signals, the hybrid can leverage global popularity as a fallback, collaborative filtering for personalisation, and content similarity for new items — all at the same time.

Strength: More robust than any single approach; gracefully degrades when one signal is weak (e.g. cold-start users). Weakness: More complex to maintain, tune, and debug; errors in any sub-model propagate into the blend.

How Models Are Evaluated

Offline Metrics

Before deploying a model to real users, we evaluate it on a held-out test set — the most recent interactions for each user that were withheld during training.

Metric	What it measures
Precision@K	Of the top K recommendations, what fraction did the user actually interact with?
Recall@K	Of all items the user actually interacted with, what fraction appear in the top K?
NDCG@K	Like Precision, but hitting items ranked higher counts more than hitting items ranked lower
Catalog coverage	What fraction of all products ever appear in any recommendation list? Low coverage means a popularity bias — the same popular items are endlessly recycled
Category diversity	Within a single recommendation list, how many different product categories are represented? A list of 20 identical-category items is less useful than a varied one

Guardrails automatically flag models with coverage below 5% or diversity below 0.30.

Online Testing: Comparing Models with Real Users

Offline metrics tell you which model performs better on historical data, but that does not always predict which model users will prefer in real life. Online testing exposes real users to the models and measures actual behaviour.

A/B Testing (and A/B/C/D/E Testing)

The classic approach is to randomly split your user base into groups and assign each group to one model. After a period of time, you compare the key metrics — click-through rate (CTR), session depth (how many pages a user views), and purchase rate — across groups using a statistical significance test (this project uses the Mann-Whitney U test).

In this project, users are split into five equal groups, one per model. The popularity model is the control group — the reference point every other model is measured against — because it represents the simplest deployable behaviour and requires no user history to produce a result. Every other model is a treatment group that must justify its added complexity by outperforming the control. The experiment report tells you:

The mean CTR, session depth, and purchase rate for each group
The percentage uplift of each model over the baseline
Whether the difference is statistically significant (p < 0.05)
Which model has the highest CTR overall

Group           n=8185   CTR     Session Depth   Purchase Rate
──────────────────────────────────────────────────────────────
popularity      8185     0.0099  7.74            0.0025  ← control
als             8185     0.0141  9.33            0.0039  ↑ +43%  p<0.001
item_cf         8185     0.0154  9.07            0.0043  ↑ +56%  p<0.001  ← best CTR
content_based   8185     0.0101  6.79            0.0026  ↑  +2%  p=0.72
hybrid          8185     0.0149  8.81            0.0042  ↑ +51%  p<0.001

Session depth uses a Pareto (power-law) distribution as its base: most users view 2–5 pages per session, but a heavy tail of power users views 30–100 pages. Users who received relevant recommendations (higher CTR) stay significantly longer — the (1 + 50 × ctr) multiplier means a model with twice the CTR produces roughly twice the session engagement.

Limitation of A/B testing: each user sees only one model, so you need large user populations and long experiment durations to achieve statistical significance. Different users are also inherently different (positional bias), which adds noise.

Interleaving with Team-Draft Multileaving

Interleaving is a faster, more sensitive comparison method. Instead of assigning each user to one model, the same user sees a merged list built from multiple models at once. The key idea: if a user clicks on an item that was suggested by model A, model A gets "credit".

Team-draft multileaving works like a sports draft:

The models take turns in random order picking items for the merged list
Each model always picks its highest-ranked item not yet in the list
Each item is "owned" by the model that drafted it
After showing the merged list, any clicked items generate credit for the model that owned them

A model "wins" a user interaction if it earned strictly more clicks than every other model. Aggregating win rates over hundreds of users produces a reliable ranking with far fewer users than a standard A/B test — typically 10–100× more sensitive.

Model           Win Rate   Avg Credits
──────────────────────────────────────
popularity      0.010      0.010
als             0.050      0.050   ← winner
item_cf         0.030      0.035
content_based   0.010      0.015
hybrid          0.025      0.035

win_rate vs. avg_credits: avg_credits is the average number of clicked items a model drafted per user — an absolute quality signal. win_rate is the fraction of users where that model earned strictly more credits than every rival. They correlate but diverge when two strong models tie frequently: both accumulate credits but neither wins outright.

Interleaving vs. A/B testing: interleaving detects differences faster and controls for user heterogeneity because each user evaluates all models simultaneously. However, it cannot measure absolute metric levels (only relative preference), and the merged list may not reflect what real users would see in production.

Project Structure

recommender_system_example/
├── config/
│   └── config.yaml           # All tunable parameters (scale, hyperparameters, thresholds)
├── data/
│   ├── raw/                  # Parquet: users, products, editorial, interactions
│   ├── processed/            # Parquet: user/item features
│   ├── models/               # Trained model artifacts
│   ├── recommendations/      # Batch recommendation outputs + eval results
│   └── experiments/          # A/B test report and per-user results
├── src/
│   ├── config.py             # Paths, DataConfig, ModelConfig, EvaluationConfig, ExperimentConfig
│   ├── data_generation/      # Synthetic data generator
│   ├── features/             # Feature engineering pipeline
│   ├── models/               # Model implementations + training script
│   ├── recommenders/         # Batch recommendation generation
│   ├── evaluation/           # Offline metrics (Precision/Recall/NDCG)
│   ├── experiments/          # A/B/C/D/E test + interleaving simulation
│   └── dashboard/            # Plotly Dash app
├── tests/                    # pytest test suite (154 tests)
├── notebooks/                # Exploratory notebooks (not part of pipelines)
├── Makefile
├── pyproject.toml
└── CLAUDE.md

Models at a Glance

Model	Algorithm	Personalised	Handles New Items	Cold-Start Safe
`popularity`	Recency-weighted counts	No	Yes	Yes
`als`	Matrix factorisation (ALS)	Yes	No	No
`itemcf`	Item-item cosine similarity	Yes	No	No
`content_based`	TF-IDF + cosine similarity	Yes	Yes	Partial
`hybrid`	Weighted blend of all four models	Yes	Partial	Partial

Evaluation Metrics

Computed on a temporal hold-out test split:

Precision@K, Recall@K, NDCG@K — at K=10 and K=20
Catalog coverage — fraction of product catalog recommended
Category diversity — intra-list diversity score

Guardrails flag models with coverage < 5% or diversity < 0.30.

Dashboard Tabs

Recommendation Diagnostics — coverage, diversity, popularity bias charts
User Journey — funnel from impression → click → purchase
Model Comparison — side-by-side metrics for all models
Experiment Results — A/B/C/D/E test summary with CTR uplift, significance indicators, and interleaving win-rate charts

Running Tests

make test          # full suite with coverage report
make test-verbose  # verbose output
make test-ci       # coverage + XML report (for CI)

Tests use small synthetic fixtures and do not require the full data pipeline to be run first (159 tests).

Configuration

All tunable parameters live in config/config.yaml. Changes take effect on the next pipeline run — no code changes needed.

data:
  n_users: 100_000
  n_products: 10_000
  n_editorial: 1_000
  n_interactions: 3_000_000
  random_seed: 42

model:
  als_factors: 64
  als_iterations: 20
  top_k: 20
  hybrid_pop_weight: 0.10
  hybrid_als_weight: 0.40
  hybrid_itemcf_weight: 0.30
  hybrid_cb_weight: 0.20
  itemcf_k_neighbours: 50

evaluation:
  k_values: [10, 20]
  coverage_guardrail: 0.05
  diversity_guardrail: 0.30

experiment:
  control_model: popularity
  alpha: 0.05
  interleaving_k: 20
  interleaving_n_users: 200

Code Quality

make lint       # format with ruff
make quality    # cyclomatic complexity + maintainability index (radon)
make security   # static security analysis (bandit)
make audit      # dependency vulnerability scan (pip-audit)

Requirements

Python 3.11+
uv package manager
2+ CPU cores, ~4 GB RAM recommended

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Recommendation Intelligence Lab

Quick Start

What Is a Recommender System?

Two Types of Items: Products and Editorial Content

Products

Editorial Content

How the synthetic data is designed to produce learnable patterns

The `n_editorial` parameter

Recommendation Algorithms

Popularity Baseline

Why popularity is the baseline (control) model

Collaborative Filtering — ALS (Alternating Least Squares)

Collaborative Filtering — Item-CF (Item-based)

Content-Based Filtering

Hybrid Recommender

How Models Are Evaluated

Offline Metrics

Online Testing: Comparing Models with Real Users

A/B Testing (and A/B/C/D/E Testing)

Interleaving with Team-Draft Multileaving

Project Structure

Models at a Glance

Evaluation Metrics

Dashboard Tabs

Running Tests

Configuration

Code Quality

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

E-commerce Recommendation Intelligence Lab

Quick Start

What Is a Recommender System?

Two Types of Items: Products and Editorial Content

Products

Editorial Content

How the synthetic data is designed to produce learnable patterns

The n_editorial parameter

Recommendation Algorithms

Popularity Baseline

Why popularity is the baseline (control) model

Collaborative Filtering — ALS (Alternating Least Squares)

Collaborative Filtering — Item-CF (Item-based)

Content-Based Filtering

Hybrid Recommender

How Models Are Evaluated

Offline Metrics

Online Testing: Comparing Models with Real Users

A/B Testing (and A/B/C/D/E Testing)

Interleaving with Team-Draft Multileaving

Project Structure

Models at a Glance

Evaluation Metrics

Dashboard Tabs

Running Tests

Configuration

Code Quality

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `n_editorial` parameter

Packages