Skip to content

Conversation

vignesh14052002
Copy link

@jbrockmendel
Copy link
Member

Perf impact? This is here for a reason.

@vignesh14052002
Copy link
Author

This is the commit that introduces the change
6c31cab

I dont't understand why this was added, but reverting solves the below issue

from pandas._libs.parsers import sanitize_objects

values = np.array([1,"NA",True],dtype=object)
print("Values before sanitization:",values)
sanitize_objects(values,na_values={"NA"})
print("Values after sanitization:",values)

output

Values before sanitization: [1 'NA' True]
Values after sanitization: [1 nan 1]

Eventhough the sanitization parts works fine (NA->nan), it is converting True to 1 and that is due to the memo
I have some issues setting up the environment to run performance tests

@jbrockmendel
Copy link
Member

I dont't understand why this was added, but reverting solves the below issue

The commit message was "memoize objects when reading from file to reduce memory footprint". So removing it will likely balloon memory footprint. Instead of removing it, might be more effective to just check for 0, 1, True, False explicitly and let other values be memoized?

@vignesh14052002
Copy link
Author

Thanks, now i understand about memory footprint. skipping memoization just for those 4 values might not be a good approach, because what if the data contains only mixture of those 4 values? it can blew up the memory

I have included type of the value too in memo key, which will solve this

@vignesh14052002 vignesh14052002 changed the title fix : remove memo usage in sanitization function fix : include datatype in memo key in sanitization function Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Datatypes not preserved on pd.read_excel
2 participants