Skip to content

Conversation

@LauraBoenchenLB
Copy link
Contributor

@LauraBoenchenLB LauraBoenchenLB commented Feb 12, 2025

Medmodels Pull Request

Description

Added descriptive statistics for different attribute types. Also moved the functions that were in the _overview.py module before to the newly created module. Fixed and added tests.
This module is for descriptive statistics only, the evaluator will use these functions for analyses.

@LauraBoenchenLB LauraBoenchenLB changed the title refactor overview and add tests feat: add descriptive statistics module Feb 12, 2025
@LauraBoenchenLB LauraBoenchenLB marked this pull request as ready for review February 12, 2025 12:20
@LauraBoenchenLB LauraBoenchenLB marked this pull request as draft February 13, 2025 12:29
@LauraBoenchenLB LauraBoenchenLB marked this pull request as ready for review February 13, 2025 13:29
Copy link
Contributor

@MarIniOnz MarIniOnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR!

Small naming comments and a couple of performance questions! Excited to use the data comparer!

assert isinstance(third_quartile, (int, float))

return {
"type": "Continuous",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to use a Enum or typed variable in here, so that we do not have strings around.

TemporalAttributeInfo: Dictionary containg attribute metrics.
"""
return {
"type": "Temporal",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

assert isinstance(q3, datetime)

return {
"type": "Temporal",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

values_string = f"{len(values)} {long_string_suffix}"

return {
"type": "Categorical",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

"""Dictionary for a string attribute and its values."""

type: Literal["Categorical"]
count: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change it to number of categories


for dictionary in attribute_dictionary.values():
for key, value in dictionary.items():
data.setdefault(key, []).append(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not know about this, but it looks a bit like it does not have the optimal performance. @JabobKrauskopf ?

if isinstance(data_type, DateTime):
return AttributeType.Temporal

# TODO @Laura: add new string type after PR #325
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue to reference it

Comment on lines +28 to +37
concept_nodes = medrecord.group(concept)["nodes"]

for concept_node in concept_nodes:
concept_counts[concept_node] = len(
medrecord.edges_connecting(
concept_node,
medrecord.group(cohort)["nodes"],
directed=EdgesDirected.UNDIRECTED,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization problem found when using edges_connecting. Querying for those edges is apparently 100x faster.

Suggested change
concept_nodes = medrecord.group(concept)["nodes"]
for concept_node in concept_nodes:
concept_counts[concept_node] = len(
medrecord.edges_connecting(
concept_node,
medrecord.group(cohort)["nodes"],
directed=EdgesDirected.UNDIRECTED,
)
)
concept_nodes = medrecord.group(concept)["nodes"]
def test_one(edge: EdgeOperand):
edge.source_node().in_group(cohort)
edge.target_node().index().equal_to(concept_node)
def test_two(edge: EdgeOperand):
edge.target_node().in_group(cohort)
edge.source_node().index().equal_to(concept_node)
for concept_node in tqdm(concept_nodes, desc="Getting concept counts"):
concept_counts[concept_node] = len(
medrecord.select_edges(lambda edge: edge.either_or(test_one, test_two))
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please adequate typing and docstrings of the query function accordingly. @JabobKrauskopf we should probably look into this mismatch in efficiency. Although I do not know if this could be someone implemented inside Rust, since I also use it in MTGAN and it takes a lot of time.

def extract_top_k_concepts(
concept_counts: Dict[NodeIndex, int], top_k: int
) -> List[Tuple[NodeIndex, int]]:
"""Extract the topk concepts from a concept count dictionary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Extract the topk concepts from a concept count dictionary.
"""Extract the top k concepts from a concept count dictionary.

):
neighbors_matching._check_nodes(
medrecord=self.medrecord,
medrecord=medrecord_missing,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only use it for this specific case?

@LauraBoenchenLB LauraBoenchenLB marked this pull request as draft April 7, 2025 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants