Commit 8dbdaeb
authored
feat: use encoded data consistently for similarity & distances (#187)
- use same numerical encoded data for similarity & distance sections
- removed logic for `pull_data_for_embeddings`
- drop similarity from Data Report
- no need to store PCA models to statistics_path anymore
- drop `max_sample_size_embeddings` arg from `report_from_statistics`
- use `HistGradientBoostingClassifiers` for Discriminator AUC (instead of `LogisticRegression`)
- check distances for subsets of columns
- only report the (sub)set with lowest DCR share
- try all columns, subset of correlated columns, and subset of random columns1 parent 3f956a5 commit 8dbdaeb
File tree
21 files changed
+1319
-1264
lines changed- docs
- examples
- mostlyai/qa
- tests
- end_to_end
- unit
21 files changed
+1319
-1264
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | | - | |
22 | | - | |
23 | | - | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
48 | 47 | | |
49 | 48 | | |
50 | 49 | | |
51 | | - | |
52 | 50 | | |
53 | 51 | | |
54 | 52 | | |
55 | 53 | | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
0 commit comments