The official repository of the MOSTLY AI prize, a 2x $50,000 Synthetic Data competition from May to July '25.
| Challenge | Dataset Details | Stage 1 | Stage 2 |
|---|---|---|---|
| The FLAT DATA Challenge | - 100,000 records - 80 data columns - 60 numeric, 20 categorical |
- Training Data - Holdout Data . |
- Training Data - Holdout Data - Evaluation Runs |
| The SEQUENTIAL DATA Challenge | - 20,000 groups - 5–10 records each - 10 data columns - 7 numeric, 3 categorical |
- Training Data - Holdout Data . |
- Training Data - Holdout Data - Evaluation Runs |
A BIG thank you to all participants for pushing the boundaries of synthetic data further, and achieving new state-of-the-art accuracy for large scale synthetic datasets with open-source solutions! ⭐
And a HUGE congratulations to Gandagorn 🥇 for winning the total prize of $100,000! 🎉
During stage 1 anyone with a GitHub account was invited to participate. All submissions were automatically evaluated via the Synthetic Data Quality Assurance toolkit. All results can be found here: stage1-results.csv.
| FLAT | SEQUENTIAL |
|---|---|
![]() |
![]() |
During stage 2 the top-performing leaders (Gandagorn, Tecnarca, muellermarkus, Benels, EugenioTL, filomba01) were invited to submit their code submissions, which were then evaluated on a slightly modified version of the stage 1 datasets. With the support of a fantastic jury board of synthetic data experts (suhaskowshik, adivekar-utexas, shree-gade, mplatzer, scriminaci, psitronic), these code submissions were then evaluated on dedicated GPU instances a total of 6 times each. The generated synthetic data was captured, and then again assessed with the same metrics as for Stage 1. All results can be found here: stage2-results.csv.




