[Memory] Fix large memory usage during init call. #620

LogicFan · 2025-11-19T08:29:53Z

Contributor: Yongda Fan (yongdaf2@illinois.edu), John Wu

Contribution Type: Datasets

Description
Fix the large memory usage during dataset's __init__ call.

The PR #601 has been used as a reference.

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

LogicFan · 2025-11-19T08:32:09Z

Before this fix:

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

LogicFan · 2025-11-19T08:36:03Z

After this fix:

LogicFan · 2025-11-19T08:36:41Z

Note: this does not fix the memory issue when we set task (that would still blow up the memory, and will be fix the subsequence PRs)

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

LogicFan · 2025-11-19T09:20:57Z

@jhnwu3, this lowername op seems to have no effect?

attribute_columns = [
    pl.col(attr.lower()).alias(f"{table_name}/{attr}") for attr in attribute_cols
]

changes it back to upper name anyways?

examples/memtest.py

jhnwu3

Thanks for the catch, I did find a new bug now where num_workers > 4, the tqdm doesn't show for some reason. Will investigate further.

* Add script to test memory usage of dataset Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com> * Remove collect_schema usage during __init__ Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com> * Fix some code still accessing upper case column name Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com> --------- Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

Add script to test memory usage of dataset

a20ef77

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

Remove collect_schema usage during __init__

bf2a86d

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

LogicFan requested a review from jhnwu3 November 19, 2025 08:35

LogicFan marked this pull request as ready for review November 19, 2025 08:36

Fix some code still accessing upper case column name

bcb94e2

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>

dalloliogm reviewed Nov 19, 2025

View reviewed changes

examples/memtest.py Show resolved Hide resolved

jhnwu3 approved these changes Nov 19, 2025

View reviewed changes

jhnwu3 merged commit 28269d9 into sunlabuiuc:master Nov 19, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Memory] Fix large memory usage during init call. #620

[Memory] Fix large memory usage during init call. #620

Uh oh!

LogicFan commented Nov 19, 2025 •

edited

Loading

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

Uh oh!

jhnwu3 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Memory] Fix large memory usage during __init__ call. #620

[Memory] Fix large memory usage during __init__ call. #620

Uh oh!

Conversation

LogicFan commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

LogicFan commented Nov 19, 2025

Uh oh!

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Memory] Fix large memory usage during init call. #620

[Memory] Fix large memory usage during init call. #620

LogicFan commented Nov 19, 2025 •

edited

Loading