Skip to content

Commit 8dbdaeb

Browse files
authored
feat: use encoded data consistently for similarity & distances (#187)
- use same numerical encoded data for similarity & distance sections - removed logic for `pull_data_for_embeddings` - drop similarity from Data Report - no need to store PCA models to statistics_path anymore - drop `max_sample_size_embeddings` arg from `report_from_statistics` - use `HistGradientBoostingClassifiers` for Discriminator AUC (instead of `LogisticRegression`) - check distances for subsets of columns - only report the (sub)set with lowest DCR share - try all columns, subset of correlated columns, and subset of random columns
1 parent 3f956a5 commit 8dbdaeb

21 files changed

+1319
-1264
lines changed

docs/examples.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,11 @@ from mostlyai import qa
1717
repo = "https://github.com/mostly-ai/mostlyai-qa"
1818
path = "/raw/refs/heads/main/examples/baseball-players"
1919

20+
cols = ["birthDate", "deathDate", "weight", "height", "bats", "throws"]
2021
report_path, metrics = qa.report(
21-
syn_tgt_data=pd.read_parquet(f"{repo}/{path}/synthetic-target.pqt"),
22-
trn_tgt_data=pd.read_parquet(f"{repo}/{path}/training-target.pqt"),
23-
hol_tgt_data=pd.read_parquet(f"{repo}/{path}/holdout-target.pqt"),
22+
syn_tgt_data=pd.read_parquet(f"{repo}/{path}/synthetic-target.pqt")[cols],
23+
trn_tgt_data=pd.read_parquet(f"{repo}/{path}/training-target.pqt")[cols],
24+
hol_tgt_data=pd.read_parquet(f"{repo}/{path}/holdout-target.pqt")[cols],
2425
report_subtitle=" for Baseball Players",
2526
report_path="baseball-players.html",
2627
)

examples/baseball-players-with-context.html

Lines changed: 139 additions & 69 deletions
Large diffs are not rendered by default.

examples/baseball-players.html

Lines changed: 134 additions & 103 deletions
Large diffs are not rendered by default.

examples/baseball-seasons-with-context.html

Lines changed: 196 additions & 121 deletions
Large diffs are not rendered by default.

examples/baseball-seasons.html

Lines changed: 180 additions & 105 deletions
Large diffs are not rendered by default.

mkdocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,15 +44,16 @@ plugins:
4444
show_root_heading: false
4545
show_object_full_path: true
4646
show_bases: false
47-
show_docstring: true
4847
show_source: false
4948
show_signature: true
5049
separate_signature: true
51-
show_docstring_examples: true
5250
docstring_section_style: table
5351
extensions:
5452
- griffe_fieldz
5553
docstring_style: google
54+
extra:
55+
show_docstring: true
56+
show_docstring_examples: true
5657

5758
markdown_extensions:
5859
- pymdownx.highlight:

0 commit comments

Comments
 (0)