-
Notifications
You must be signed in to change notification settings - Fork 0
feat: using charset_normalizer instead of chardet #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: augment_full-combined-benchmark_31122025_base_feat_using_charset_normalizer_instead_of_chardet
Are you sure you want to change the base?
Conversation
🤖 Augment PR SummarySummary: This PR replaces Changes:
Technical Notes: Encoding detection semantics differ slightly (e.g., confidence/coherence and decode fallbacks), so behavior on ambiguous/invalid byte sequences may change. 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| rawdata = f.read(sample_size) | ||
| return cast(list[dict], chardet.detect_all(rawdata)) | ||
| def read_and_detect(filename: str): | ||
| rst = charset_normalizer.from_path(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detect_file_encodings() still accepts sample_size, but the new charset_normalizer.from_path() path ignores it, so large files may be fully read despite the docstring claiming sampling to prevent timeouts. This can increase memory/latency and make the timeout behavior less predictable.
🤖 Was this useful? React with 👍 or 👎
| encoding = "utf-8" | ||
|
|
||
| return file_content.decode(encoding, errors="ignore") | ||
| return file_content.decode(encoding, errors="strict") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching plain-text decoding to errors="strict" can cause a UnicodeDecodeError on otherwise mostly-decodable content and then fall back to UTF-8, potentially losing the correctly-detected encoding’s output. This seems like a behavior regression compared to the prior errors="ignore" approach.
🤖 Was this useful? React with 👍 or 👎
| handleToggle, | ||
| className, | ||
| }: ToggleButtonProps) => { | ||
| const unusedVar = 'This variable is not used' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark PR from qodo-benchmark#72