Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions csv_detective/parsing/columns.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,10 @@ def build_remaining_tests_per_col(return_table: pd.DataFrame) -> dict[str, list[
for col in table.columns
}
handle_empty_columns(return_table)
empty_cols = {col for col in table.columns if table[col].dropna().empty}
empty_cols = (
# if NA values should be considered valid, then we can skip empty columns
{col for col in table.columns if table[col].isna().all()} if skipna else {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty_cols is also used to restore tests for columns that are all-NA in the first chunk but have data in later chunks (lines 222-228). Setting it to {} when skipna=False breaks that restoration mechanism — those columns will never be tested even if later batches contain data.

I have a local test I can push on your branch to expose this bug (and avoid it in the futur?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the problem: setting empty_cols to {} here is the same as considering all columns have non-NA values (which could happen), and the rest of the function works the same afterwards. We'll never enter the loop at line 222, and rebuild remaining_tests_per_col at line 250, at the end of the batch

)
remaining_tests_per_col = build_remaining_tests_per_col(return_table)

# hashing rows to get nb_duplicates
Expand Down Expand Up @@ -217,7 +220,7 @@ def build_remaining_tests_per_col(return_table: pd.DataFrame) -> dict[str, list[
fill_value=0,
)
for col in list(empty_cols):
if not batch[col].dropna().empty:
if not batch[col].isna().all():
Comment thread
Pierlou marked this conversation as resolved.
empty_cols.discard(col)
remaining_tests_per_col[col] = [
fmt_label
Expand Down