Generate meaningful clusters from browser session data represented as paths
As an e-commerce site, have you ever wanted automatic, behavior-based segmentation of your customers without manually defining segments?
Ampelios is a data pipeline and microservice that takes raw browser event data and produces clusters useful for:
- Identifying high-value users
- Finding moderately likely buyers
- Filtering out users with no purchase potential
- Tracking behavioural change over time
timestamp,visitorid,event,transactionid
433221332117,257598,view,
433221078505,158091,addtocart,
433221999827,111014,view,
433193500981,122686,transaction,11the demo dataset has 2.7 million rows
[
{
"id": 0,
"users": [0, 6, 7, 13, 22],
"centroid": [
1.5968122,
0.03344515,
0.008624283,
1.1687442,
1.3817313,
0.0044482
]
},
{
"id": 1,
"users": [302, 588, 904, 914, 159],
"centroid": [
0.9970998,
0.012757817,
0.0033784304,
1.0007713,
0.9967424,
0.0033321506
]
},
{
"id": 2,
"users": [1722, 1879, 2019, 2114, 2194],
"centroid": [
13.875828,
0.6650781,
0.22663489,
2.8512046,
7.307564,
0.012281195
]
},
{
"id": 3,
"users": [2, 37, 51, 54, 64],
"centroid": [
5.205463,
0.10490605,
0.029401531,
1.4163188,
4.3710628,
0.0054033287
]
},
{
"id": 4,
"users": [1, 3, 4, 5, 8],
"centroid": [
1,
0,
0,
1,
1,
0
]
}
]Note that the users fields have been truncated for easy viewing
This project is designed to work with a Kafka consumer for incremental real-time data, but for initial development and proof-of-concept you can load from a CSV dataset.
Use the Retailrocket recommender dataset. The dataset is licensed under CC BY-NC-SA 4.0.
Please comply with the dataset license; it is only intended as a demo for Ampelios. Note: The dataset is not included in this repository.
-
Download the dataset and place
events.csvininit-data/events.csvrelative to the project root. -
Start Ampelios with Docker:
cd infra
docker compose up- Call the trigger endpoint to start a pipeline run. Specify a source_id to uniquely identify this dataset and derived clusters within ampelios. Note: With the full sample dataset (~2.7M rows), it currently takes ~10 minutes. In production with incremental data, the pipeline could be polled every minute for smooth flow 🌊.
curl -X POST http://127.0.0.1:1032/trigger \
-H "Content-Type: application/json" \
-d '{
"source_id": 1,
"cluster_count": 5,
"events_path": "./init-data/events.csv",
"is_initial_flow": true
}'- Call the view clusters endpoint to see results for the specified source.
curl -X GET http://127.0.0.1:1032/view?source_id=1The DAG handles all clustering steps from source data;
- save_raw_events: loads raw CSV events into the pipeline
- save_raw_events_sessions: annotates events with session numbers for analysis
- load_journeys: loads user session paths from the
eventstable in Postgres - cluster_journeys: performs clustering storing updated labels and centroids
A microservice wraps the pipeline as a Python module:
POST /trigger→ start a pipeline runGET /view→ view output clusters
Each user journey is represented with:
- Final state vector: page views, add-to-carts, transactions
- Path length (number of sessions)
- Views per session
- Purchase ratio (buys per session / views per session)
Clustering is performed using K-Means (MiniBatchKMeans) over these features.
- Endpoints & Pipeline Jobs – API and job reference
- Architecture – System design and data flow
- Local setup - How to run stuff locally
- Code: Apache 2.0
- Sample Dataset: CC BY-NC-SA 4.0 (attribution required, non-commercial, share-alike)
Please respect dataset licensing when using it for examples or testing.