-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add descriptive statistics module #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: epic/157-create-an-evaluatorcomparer
Are you sure you want to change the base?
feat: add descriptive statistics module #331
Conversation
MarIniOnz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR!
Small naming comments and a couple of performance questions! Excited to use the data comparer!
| assert isinstance(third_quartile, (int, float)) | ||
|
|
||
| return { | ||
| "type": "Continuous", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better to use a Enum or typed variable in here, so that we do not have strings around.
| TemporalAttributeInfo: Dictionary containg attribute metrics. | ||
| """ | ||
| return { | ||
| "type": "Temporal", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| assert isinstance(q3, datetime) | ||
|
|
||
| return { | ||
| "type": "Temporal", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| values_string = f"{len(values)} {long_string_suffix}" | ||
|
|
||
| return { | ||
| "type": "Categorical", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| """Dictionary for a string attribute and its values.""" | ||
|
|
||
| type: Literal["Categorical"] | ||
| count: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change it to number of categories
|
|
||
| for dictionary in attribute_dictionary.values(): | ||
| for key, value in dictionary.items(): | ||
| data.setdefault(key, []).append(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not know about this, but it looks a bit like it does not have the optimal performance. @JabobKrauskopf ?
| if isinstance(data_type, DateTime): | ||
| return AttributeType.Temporal | ||
|
|
||
| # TODO @Laura: add new string type after PR #325 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an issue to reference it
| concept_nodes = medrecord.group(concept)["nodes"] | ||
|
|
||
| for concept_node in concept_nodes: | ||
| concept_counts[concept_node] = len( | ||
| medrecord.edges_connecting( | ||
| concept_node, | ||
| medrecord.group(cohort)["nodes"], | ||
| directed=EdgesDirected.UNDIRECTED, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization problem found when using edges_connecting. Querying for those edges is apparently 100x faster.
| concept_nodes = medrecord.group(concept)["nodes"] | |
| for concept_node in concept_nodes: | |
| concept_counts[concept_node] = len( | |
| medrecord.edges_connecting( | |
| concept_node, | |
| medrecord.group(cohort)["nodes"], | |
| directed=EdgesDirected.UNDIRECTED, | |
| ) | |
| ) | |
| concept_nodes = medrecord.group(concept)["nodes"] | |
| def test_one(edge: EdgeOperand): | |
| edge.source_node().in_group(cohort) | |
| edge.target_node().index().equal_to(concept_node) | |
| def test_two(edge: EdgeOperand): | |
| edge.target_node().in_group(cohort) | |
| edge.source_node().index().equal_to(concept_node) | |
| for concept_node in tqdm(concept_nodes, desc="Getting concept counts"): | |
| concept_counts[concept_node] = len( | |
| medrecord.select_edges(lambda edge: edge.either_or(test_one, test_two)) | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please adequate typing and docstrings of the query function accordingly. @JabobKrauskopf we should probably look into this mismatch in efficiency. Although I do not know if this could be someone implemented inside Rust, since I also use it in MTGAN and it takes a lot of time.
| def extract_top_k_concepts( | ||
| concept_counts: Dict[NodeIndex, int], top_k: int | ||
| ) -> List[Tuple[NodeIndex, int]]: | ||
| """Extract the topk concepts from a concept count dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Extract the topk concepts from a concept count dictionary. | |
| """Extract the top k concepts from a concept count dictionary. |
| ): | ||
| neighbors_matching._check_nodes( | ||
| medrecord=self.medrecord, | ||
| medrecord=medrecord_missing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only use it for this specific case?
Medmodels Pull Request
Description
Added descriptive statistics for different attribute types. Also moved the functions that were in the
_overview.pymodule before to the newly created module. Fixed and added tests.This module is for descriptive statistics only, the evaluator will use these functions for analyses.