-
Notifications
You must be signed in to change notification settings - Fork 476
[Memory] Fix large memory usage during __init__ call. #620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
|
Note: this does not fix the memory issue when we set task (that would still blow up the memory, and will be fix the subsequence PRs) |
Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
|
@jhnwu3, this lowername op seems to have no effect? attribute_columns = [
pl.col(attr.lower()).alias(f"{table_name}/{attr}") for attr in attribute_cols
]changes it back to upper name anyways? |
jhnwu3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, I did find a new bug now where num_workers > 4, the tqdm doesn't show for some reason. Will investigate further.
* Add script to test memory usage of dataset Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com> * Remove collect_schema usage during __init__ Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com> * Fix some code still accessing upper case column name Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com> --------- Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>


Contributor: Yongda Fan (yongdaf2@illinois.edu), John Wu
Contribution Type: Datasets
Description
Fix the large memory usage during dataset's
__init__call.The PR #601 has been used as a reference.