feat: add descriptive statistics module #331

LauraBoenchenLB · 2025-02-12T12:09:38Z

Medmodels Pull Request

Description

Added descriptive statistics for different attribute types. Also moved the functions that were in the _overview.py module before to the newly created module. Fixed and added tests.
This module is for descriptive statistics only, the evaluator will use these functions for analyses.

…ive-statistics

MarIniOnz

Great PR!

Small naming comments and a couple of performance questions! Excited to use the data comparer!

MarIniOnz · 2025-02-14T08:34:32Z

medmodels/statistic_evaluations/statistical_analysis/descriptive_statistics.py

+    assert isinstance(third_quartile, (int, float))
+
+    return {
+        "type": "Continuous",


Maybe it is better to use a Enum or typed variable in here, so that we do not have strings around.

MarIniOnz · 2025-02-14T08:34:59Z

medmodels/statistic_evaluations/statistical_analysis/descriptive_statistics.py

+        TemporalAttributeInfo: Dictionary containg attribute metrics.
+    """
+    return {
+        "type": "Temporal",


same as above

MarIniOnz · 2025-02-14T08:35:07Z

medmodels/statistic_evaluations/statistical_analysis/descriptive_statistics.py

+    assert isinstance(q3, datetime)
+
+    return {
+        "type": "Temporal",


same as above

MarIniOnz · 2025-02-14T08:35:26Z

medmodels/statistic_evaluations/statistical_analysis/descriptive_statistics.py

+        values_string = f"{len(values)} {long_string_suffix}"
+
+    return {
+        "type": "Categorical",


same as above

MarIniOnz · 2025-03-10T08:34:25Z

medmodels/statistic_evaluations/statistical_analysis/attribute_analysis.py

+    """Dictionary for a string attribute and its values."""
+
+    type: Literal["Categorical"]
+    count: int


I would change it to number of categories

MarIniOnz · 2025-03-10T09:02:19Z

medmodels/statistic_evaluations/statistical_analysis/attribute_analysis.py

+
+    for dictionary in attribute_dictionary.values():
+        for key, value in dictionary.items():
+            data.setdefault(key, []).append(value)


Do not know about this, but it looks a bit like it does not have the optimal performance. @JabobKrauskopf ?

MarIniOnz · 2025-03-10T09:02:45Z

medmodels/statistic_evaluations/statistical_analysis/attribute_analysis.py

+        if isinstance(data_type, DateTime):
+            return AttributeType.Temporal
+
+    # TODO @Laura: add new string type after PR #325


Create an issue to reference it

MarIniOnz · 2025-03-10T09:07:29Z

medmodels/statistic_evaluations/statistical_analysis/concepts_analysis.py

+    concept_nodes = medrecord.group(concept)["nodes"]
+
+    for concept_node in concept_nodes:
+        concept_counts[concept_node] = len(
+            medrecord.edges_connecting(
+                concept_node,
+                medrecord.group(cohort)["nodes"],
+                directed=EdgesDirected.UNDIRECTED,
+            )
+        )


Optimization problem found when using edges_connecting. Querying for those edges is apparently 100x faster.

Suggested change

concept_nodes = medrecord.group(concept)["nodes"]

for concept_node in concept_nodes:

concept_counts[concept_node] = len(

medrecord.edges_connecting(

concept_node,

medrecord.group(cohort)["nodes"],

directed=EdgesDirected.UNDIRECTED,

)

)

concept_nodes = medrecord.group(concept)["nodes"]

def test_one(edge: EdgeOperand):

edge.source_node().in_group(cohort)

edge.target_node().index().equal_to(concept_node)

def test_two(edge: EdgeOperand):

edge.target_node().in_group(cohort)

edge.source_node().index().equal_to(concept_node)

for concept_node in tqdm(concept_nodes, desc="Getting concept counts"):

concept_counts[concept_node] = len(

medrecord.select_edges(lambda edge: edge.either_or(test_one, test_two))

)

Please adequate typing and docstrings of the query function accordingly. @JabobKrauskopf we should probably look into this mismatch in efficiency. Although I do not know if this could be someone implemented inside Rust, since I also use it in MTGAN and it takes a lot of time.

MarIniOnz · 2025-03-10T09:10:50Z

medmodels/statistic_evaluations/statistical_analysis/concepts_analysis.py

+def extract_top_k_concepts(
+    concept_counts: Dict[NodeIndex, int], top_k: int
+) -> List[Tuple[NodeIndex, int]]:
+    """Extract the topk concepts from a concept count dictionary.


Suggested change

"""Extract the topk concepts from a concept count dictionary.

"""Extract the top k concepts from a concept count dictionary.

MarIniOnz · 2025-03-10T09:18:13Z

medmodels/treatment_effect/matching/tests/test_matching.py

        ):
            neighbors_matching._check_nodes(
-                medrecord=self.medrecord,
+                medrecord=medrecord_missing,


Why only use it for this specific case?

refactor overview and add tests

7b0c19e

LauraBoenchenLB changed the title ~~refactor overview and add tests~~ feat: add descriptive statistics module Feb 12, 2025

LauraBoenchenLB marked this pull request as ready for review February 12, 2025 12:20

LauraBoenchenLB requested review from a team and JabobKrauskopf as code owners February 12, 2025 12:20

LauraBoenchenLB requested review from FloLimebit, MarIniOnz, OFranke and philippgrosser February 12, 2025 12:20

Merge branch 'epic/157-create-an-evaluatorcomparer' into 215-descript…

5df982f

…ive-statistics

LauraBoenchenLB marked this pull request as draft February 13, 2025 12:29

added datatype from type

ee1e7d9

LauraBoenchenLB marked this pull request as ready for review February 13, 2025 13:29

LauraBoenchenLB added 4 commits February 17, 2025 18:51

update summary functions

5c90531

added concept analysis

d7965e4

fix linting errors

f26c391

added functionality attribute values

932bc46

MarIniOnz suggested changes Mar 10, 2025

View reviewed changes

LauraBoenchenLB marked this pull request as draft April 7, 2025 11:12

	"""Extract the topk concepts from a concept count dictionary.
	"""Extract the top k concepts from a concept count dictionary.

feat: add descriptive statistics module #331

Are you sure you want to change the base?

feat: add descriptive statistics module #331

Uh oh!

Conversation

LauraBoenchenLB commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Medmodels Pull Request

Description

Uh oh!

MarIniOnz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LauraBoenchenLB commented Feb 12, 2025 •

edited

Loading