Skip to content

Commit e3330b5

Browse files
authored
provide docs (#26)
1 parent 57fdbbd commit e3330b5

37 files changed

+962
-1123
lines changed

β€Ž.gitignoreβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ examples/*.py
88
LICENSE_HEADER
99
model-report.html
1010
data-report.html
11+
/site/

β€Ž.pre-commit-config.yamlβ€Ž

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -49,18 +49,3 @@ repos:
4949
- id: ruff
5050
args: [ --fix ]
5151
- id: ruff-format
52-
- repo: https://github.com/pre-commit/mirrors-prettier
53-
rev: v3.0.2
54-
hooks:
55-
- id: prettier
56-
entry: prettier --write --list-different --ignore-unknown
57-
types: [markdown]
58-
- repo: https://github.com/codespell-project/codespell
59-
rev: v2.2.6
60-
hooks:
61-
- id: codespell
62-
name: codespell
63-
description: Checks for common misspellings in text files.
64-
entry: codespell -w
65-
language: python
66-
types: [markdown]

β€ŽMakefileβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,3 +97,7 @@ clean-dist: ## remove "volatile" directory dist
9797
@echo "Step: cleaning dist directory"
9898
@rm -rf dist
9999
@echo "Cleaned up dist directory"
100+
101+
docs: ## Update docs site
102+
@mkdocs gh-deploy
103+
@echo "Deployed docs"

β€ŽREADME.mdβ€Ž

Lines changed: 10 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
# Synthetic Data - Quality Assurance
1+
# Synthetic Data - Quality Assurance πŸ”Ž
22

3-
[![](https://pepy.tech/badge/mostlyai-qa)](https://pypi.org/project/mostlyai-qa/) ![](https://img.shields.io/github/license/mostly-ai/mostlyai-qa) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-qa) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai-qa)
3+
[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-qa/) [![stats](https://pepy.tech/badge/mostlyai-qa)](https://pypi.org/project/mostlyai-qa/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-qa) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-qa) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai-qa)
4+
5+
[Documentation](https://mostly-ai.github.io/mostlyai-qa/) | [Sample Reports](#sample-reports) | [Technical White Paper](https://raw.githubusercontent.com/mostly-ai/mostlyai-qa/refs/heads/main/docs/mostlyai-qa-technical-white-paper.pdf)
46

57
Assess the fidelity and novelty of synthetic samples with respect to original samples:
68

79
1. calculate a rich set of accuracy, similarity and distance metrics
810
2. visualize statistics for easy comparison to training and holdout samples
911
3. generate a standalone, easy-to-share, easy-to-read HTML summary report
1012

11-
...all with a single line of Python code πŸ’₯.
13+
...all with a few lines of Python code πŸ’₯.
1214

1315
## Installation
1416

@@ -18,12 +20,11 @@ The latest release of `mostlyai-qa` can be installed via pip:
1820
pip install -U mostlyai-qa
1921
```
2022

21-
## Quick start
23+
## Quick Start
2224

2325
```python
2426
import pandas as pd
2527
import webbrowser
26-
import json
2728
from mostlyai import qa
2829

2930
# fetch original + synthetic data
@@ -47,7 +48,7 @@ print(metrics.model_dump_json(indent=4))
4748
webbrowser.open(f"file://{report_path.absolute()}")
4849
```
4950

50-
## Basic usage
51+
## Basic Usage
5152

5253
```python
5354
from mostlyai import qa
@@ -80,115 +81,10 @@ report_path, metrics = qa.report(
8081
)
8182
```
8283

83-
Note, that due to the calculation of embeddings the function call might take a while. Embedding 10k samples on a Mac M2 take for example about 40secs. Limit the size of the passed DataFrames, or use the `max_sample_size_embeddings` parameter to speed up the report.
84-
85-
## Function signature
86-
87-
```python
88-
def report(
89-
*,
90-
syn_tgt_data: pd.DataFrame,
91-
trn_tgt_data: pd.DataFrame,
92-
hol_tgt_data: pd.DataFrame | None = None,
93-
syn_ctx_data: pd.DataFrame | None = None,
94-
trn_ctx_data: pd.DataFrame | None = None,
95-
hol_ctx_data: pd.DataFrame | None = None,
96-
ctx_primary_key: str | None = None,
97-
tgt_context_key: str | None = None,
98-
report_path: str | Path | None = "model-report.html",
99-
report_title: str = "Model Report",
100-
report_subtitle: str = "",
101-
report_credits: str = REPORT_CREDITS,
102-
report_extra_info: str = "",
103-
max_sample_size_accuracy: int | None = None,
104-
max_sample_size_embeddings: int | None = None,
105-
statistics_path: str | Path | None = None,
106-
on_progress: ProgressCallback | None = None,
107-
) -> tuple[Path, Metrics | None]:
108-
"""
109-
Generate HTML report and metrics for comparing synthetic and original data samples.
110-
111-
Args:
112-
syn_tgt_data: Synthetic samples
113-
trn_tgt_data: Training samples
114-
hol_tgt_data: Holdout samples
115-
syn_ctx_data: Synthetic context samples
116-
trn_ctx_data: Training context samples
117-
hol_ctx_data: Holdout context samples
118-
ctx_primary_key: Column within the context data that contains the primary key
119-
tgt_context_key: Column within the target data that contains the key to link to the context
120-
report_path: Path of where to store the HTML report
121-
report_title: Title of the HTML report
122-
report_subtitle: Subtitle of the HTML report
123-
report_credits: Credits of the HTML report
124-
report_extra_info: Extra details to be included to the HTML report
125-
max_sample_size_accuracy: Max sample size for accuracy
126-
max_sample_size_embeddings: Max sample size for embeddings (similarity & distances)
127-
statistics_path: Path of where to store the statistics to be used by `report_from_statistics`
128-
on_progress: A custom progress callback
129-
Returns:
130-
1. Path to the HTML report
131-
2. Pydantic Metrics:
132-
- `accuracy`: # Accuracy is defined as (100% - Total Variation Distance), for each distribution, and then averaged across.
133-
- `overall`: Overall accuracy of synthetic data, i.e. average across univariate, bivariate and coherence.
134-
- `univariate`: Average accuracy of discretized univariate distributions.
135-
- `bivariate`: Average accuracy of discretized bivariate distributions.
136-
- `coherence`: Average accuracy of discretized coherence distributions. Only applicable for sequential data.
137-
- `overall_max`: Expected overall accuracy of a same-sized holdout. Serves as reference for `overall`.
138-
- `univariate_max`: Expected univariate accuracy of a same-sized holdout. Serves as reference for `univariate`.
139-
- `bivariate_max`: Expected bivariate accuracy of a same-sized holdout. Serves as reference for `bivariate`.
140-
- `coherence_max`: Expected coherence accuracy of a same-sized holdout. Serves as reference for `coherence`.
141-
- `similarity`: # All similarity metrics are calculated within an embedding space.
142-
- `cosine_similarity_training_synthetic`: Cosine similarity between training and synthetic centroids.
143-
- `cosine_similarity_training_holdout`: Cosine similarity between training and holdout centroids. Serves as reference for `cosine_similarity_training_synthetic`.
144-
- `discriminator_auc_training_synthetic`: Cross-validated AUC of a discriminative model to distinguish between training and synthetic samples.
145-
- `discriminator_auc_training_holdout`: Cross-validated AUC of a discriminative model to distinguish between training and holdout samples. Serves as reference for `discriminator_auc_training_synthetic`.
146-
- `distances`: # All distance metrics are calculated within an embedding space. An equal number of training and holdout samples is considered.
147-
- `ims_training`: Share of synthetic samples that are identical to a training sample.
148-
- `ims_holdout`: Share of synthetic samples that are identical to a holdout sample. Serves as reference for `ims_training`.
149-
- `dcr_training`: Average L2 nearest-neighbor distance between synthetic and training samples.
150-
- `dcr_holdout`: Average L2 nearest-neighbor distance between synthetic and holdout samples. Serves as reference for `dcr_training`.
151-
- `dcr_share`: Share of synthetic samples that are closer to a training sample than to a holdout sample. This shall not be significantly larger than 50\%.
152-
"""
153-
```
154-
155-
## Metrics
156-
157-
Three sets of metrics are calculated to compare synthetic data with the original data.
158-
159-
### Accuracy
160-
161-
The L1 distances between the discretized marginal distributions of the synthetic and the original training data are being calculated for all columns.
162-
The reported accuracy is expressed as 100% minus the total variational distance (TVD), which is half the L1 distance between the two distributions.
163-
These accuracies are then averaged to produce a single accuracy score, where higher scores indicate better synthetic data.
164-
165-
1. **Univariate Accuracy**: The accuracy of the univariate distributions for all target columns is measured.
166-
2. **Bivariate Accuracy**: The accuracy of all pair-wise distributions for target columns, as well as for target columns with respect to the context columns, is measured.
167-
3. **Coherence Accuracy**: The accuracy of the auto-correlation for all target columns is measured. This is applicable only for sequential data.
168-
169-
An overall accuracy score is calculated as the average of these aggregate-level scores.
170-
171-
### Similarity
172-
173-
All records are embedded into an embedding space to calculate two metrics:
174-
175-
1. **Cosine Similarity**: The cosine similarity between the centroids of the synthetic and the original training data is calculated and compared to the cosine similarity between the centroids of the original training and holdout data. Higher scores indicate better synthetic data.
176-
2. **Discriminator AUC**: A binary classifier is trained to determine whether synthetic and original training data can be distinguished based on their embeddings. This score is compared to the same metric for the original training and holdout data. A score close to 50% indicates that synthetic samples are indistinguishable from original samples.
177-
178-
### Distances
179-
180-
All records are embedded into an embedding space, and individual-level L2 distances between samples are measured. For each synthetic sample, the distance to the nearest original sample (DCR) is calculated. This is done once with respect to original training records and once with respect to holdout records. These DCRs are then compared. For privacy-safe synthetic data, it is expected that synthetic data is as close to original training data as it is to original holdout data.
181-
182-
## Sample HTML Report
183-
184-
![Metrics](./docs/screenshots/metrics.png)
185-
![Accuracy Univariates](./docs/screenshots/accuracy_univariates.png)
186-
![Accuracy Bivariates](./docs/screenshots/accuracy_bivariates.png)
187-
![Accuracy Coherence](./docs/screenshots/accuracy_coherence.png)
188-
![Similarity](./docs/screenshots/similarity.png)
189-
![Distances](./docs/screenshots/distances.png)
84+
## Sample Reports
19085

191-
See [here](./examples/) for further examples.
86+
* **[Baseball Players](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-players.html)** (Flat Data)
87+
* **[Baseball Seasons](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-seasons-with-context.html)** (Sequential Data)
19288

19389
## Citation
19490

β€Ždocs/api.mdβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
2+
# API Definitions
3+
4+
::: mostlyai.qa.report

β€Ždocs/examples.mdβ€Ž

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
2+
# Usage Examples
3+
4+
## Baseball Players (Flat Data)
5+
6+
### Case 1: Players Table only
7+
8+
```python
9+
import pandas as pd
10+
import webbrowser
11+
from mostlyai.qa import report
12+
13+
repo = "https://github.com/mostly-ai/mostlyai-qa"
14+
path = "/raw/refs/heads/main/examples/baseball-players"
15+
16+
report_path, metrics = report(
17+
syn_tgt_data=pd.read_parquet(f"{repo}/{path}/synthetic-target.pqt"),
18+
trn_tgt_data=pd.read_parquet(f"{repo}/{path}/training-target.pqt"),
19+
hol_tgt_data=pd.read_parquet(f"{repo}/{path}/holdout-target.pqt"),
20+
report_subtitle=" for Baseball Players",
21+
report_path="baseball-players.html",
22+
)
23+
print(metrics.model_dump_json(indent=4))
24+
25+
webbrowser.open(f"file://{report_path.absolute()}")
26+
```
27+
28+
πŸ‘‰ [HTML Report](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-players.html)
29+
30+
### Case 2: Players Table with Context
31+
32+
```python
33+
import pandas as pd
34+
import webbrowser
35+
from mostlyai.qa import report
36+
37+
repo = "https://github.com/mostly-ai/mostlyai-qa"
38+
path = "/raw/refs/heads/main/examples/baseball-players"
39+
40+
report_path, metrics = report(
41+
syn_tgt_data=pd.read_parquet(f"{repo}/{path}/synthetic-target.pqt"),
42+
syn_ctx_data=pd.read_parquet(f"{repo}/{path}/synthetic-context.pqt"),
43+
trn_tgt_data=pd.read_parquet(f"{repo}/{path}/training-target.pqt"),
44+
trn_ctx_data=pd.read_parquet(f"{repo}/{path}/training-context.pqt"),
45+
hol_tgt_data=pd.read_parquet(f"{repo}/{path}/holdout-target.pqt"),
46+
hol_ctx_data=pd.read_parquet(f"{repo}/{path}/holdout-context.pqt"),
47+
tgt_context_key="id",
48+
ctx_primary_key="id",
49+
report_subtitle=" for Baseball Players",
50+
report_path="baseball-players-with-context.html",
51+
)
52+
print(metrics.model_dump_json(indent=4))
53+
54+
webbrowser.open(f"file://{report_path.absolute()}")
55+
```
56+
πŸ‘‰ [HTML Report](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-players-with-context.html)
57+
58+
## Baseball Seasons (Sequential Data)
59+
60+
### Case 1: Seasons Table only
61+
62+
```python
63+
import pandas as pd
64+
import webbrowser
65+
from mostlyai.qa import report
66+
67+
repo = "https://github.com/mostly-ai/mostlyai-qa"
68+
path = "/raw/refs/heads/main/examples/baseball-seasons"
69+
70+
report_path, metrics = report(
71+
syn_tgt_data=pd.read_parquet(f"{repo}/{path}/synthetic-target.pqt"),
72+
trn_tgt_data=pd.read_parquet(f"{repo}/{path}/training-target.pqt"),
73+
hol_tgt_data=pd.read_parquet(f"{repo}/{path}/holdout-target.pqt"),
74+
report_subtitle=" for Baseball Seasons",
75+
report_path="baseball-seasons.html",
76+
)
77+
print(metrics.model_dump_json(indent=4))
78+
79+
webbrowser.open(f"file://{report_path.absolute()}")
80+
```
81+
82+
πŸ‘‰ [HTML Report](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-seasons.html)
83+
84+
### Case 2: Seasons Table with Context
85+
86+
```python
87+
import pandas as pd
88+
import webbrowser
89+
from mostlyai.qa import report
90+
91+
repo = "https://github.com/mostly-ai/mostlyai-qa"
92+
path = "/raw/refs/heads/main/examples/baseball-seasons"
93+
94+
report_path, metrics = report(
95+
syn_tgt_data=pd.read_parquet(f"{repo}/{path}/synthetic-target.pqt"),
96+
syn_ctx_data=pd.read_parquet(f"{repo}/{path}/synthetic-context.pqt"),
97+
trn_tgt_data=pd.read_parquet(f"{repo}/{path}/training-target.pqt"),
98+
trn_ctx_data=pd.read_parquet(f"{repo}/{path}/training-context.pqt"),
99+
hol_tgt_data=pd.read_parquet(f"{repo}/{path}/holdout-target.pqt"),
100+
hol_ctx_data=pd.read_parquet(f"{repo}/{path}/holdout-context.pqt"),
101+
tgt_context_key="id",
102+
ctx_primary_key="id",
103+
report_subtitle=" for Baseball Seasons",
104+
report_path="baseball-seasons-with-context.html",
105+
)
106+
print(metrics.model_dump_json(indent=4))
107+
108+
webbrowser.open(f"file://{report_path.absolute()}")
109+
```
110+
111+
πŸ‘‰ [HTML Report](https://html-preview.github.io/?url=https://github.com/mostly-ai/mostlyai-qa/blob/main/examples/baseball-seasons-with-context.html)

β€Ždocs/index.mdβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--8<-- "README.md"

β€Ždocs/logo.pngβ€Ž

46.9 KB
Loading

β€Ždocs/metrics.mdβ€Ž

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
2+
# Metrics Definitions
3+
4+
::: mostlyai.qa.metrics.Metrics
5+
::: mostlyai.qa.metrics.Accuracy
6+
::: mostlyai.qa.metrics.Similarity
7+
::: mostlyai.qa.metrics.Distances
1.22 MB
Binary file not shown.

0 commit comments

Comments
Β (0)