Skip to content

Conversation

@LogicFan
Copy link
Collaborator

@LogicFan LogicFan commented Nov 19, 2025

Contributor: Yongda Fan (yongdaf2@illinois.edu), John Wu

Contribution Type: Datasets

Description
Fix the large memory usage during dataset's __init__ call.

The PR #601 has been used as a reference.

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
@LogicFan
Copy link
Collaborator Author

Before this fix:
image

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
@LogicFan LogicFan requested a review from jhnwu3 November 19, 2025 08:35
@LogicFan
Copy link
Collaborator Author

After this fix:
image

@LogicFan
Copy link
Collaborator Author

Note: this does not fix the memory issue when we set task (that would still blow up the memory, and will be fix the subsequence PRs)

@LogicFan LogicFan marked this pull request as ready for review November 19, 2025 08:36
Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
@LogicFan
Copy link
Collaborator Author

@jhnwu3, this lowername op seems to have no effect?

attribute_columns = [
    pl.col(attr.lower()).alias(f"{table_name}/{attr}") for attr in attribute_cols
]

changes it back to upper name anyways?

Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, I did find a new bug now where num_workers > 4, the tqdm doesn't show for some reason. Will investigate further.

@jhnwu3 jhnwu3 merged commit 28269d9 into sunlabuiuc:master Nov 19, 2025
1 check passed
dalloliogm pushed a commit to dalloliogm/PyHealth that referenced this pull request Nov 26, 2025
* Add script to test memory usage of dataset

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

* Remove collect_schema usage during __init__

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

* Fix some code still accessing upper case column name

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

---------

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants