Skip to content

Conversation

@hila-f-qodo
Copy link

Benchmark PR from qodo-benchmark#72

@augmentcode
Copy link

augmentcode bot commented Dec 31, 2025

🤖 Augment PR Summary

Summary: This PR replaces chardet with charset-normalizer for encoding detection/decoding.

Changes:

  • Update file encoding detection helper to use charset_normalizer.from_path(...).best()
  • Update web URL reader to detect encoding via from_bytes(...).best() before decoding
  • Update document extractor decoding for plain text/JSON/YAML/CSV to use charset-normalizer
  • Swap the dependency in api/pyproject.toml/api/uv.lock and adjust unit tests

Technical Notes: Encoding detection semantics differ slightly (e.g., confidence/coherence and decode fallbacks), so behavior on ambiguous/invalid byte sequences may change.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

rawdata = f.read(sample_size)
return cast(list[dict], chardet.detect_all(rawdata))
def read_and_detect(filename: str):
rst = charset_normalizer.from_path(filename)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_file_encodings() still accepts sample_size, but the new charset_normalizer.from_path() path ignores it, so large files may be fully read despite the docstring claiming sampling to prevent timeouts. This can increase memory/latency and make the timeout behavior less predictable.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

encoding = "utf-8"

return file_content.decode(encoding, errors="ignore")
return file_content.decode(encoding, errors="strict")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching plain-text decoding to errors="strict" can cause a UnicodeDecodeError on otherwise mostly-decodable content and then fall back to UTF-8, potentially losing the correctly-detected encoding’s output. This seems like a behavior regression compared to the prior errors="ignore" approach.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

handleToggle,
className,
}: ToggleButtonProps) => {
const unusedVar = 'This variable is not used'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unusedVar is declared but never used; depending on TS/ESLint settings this can fail CI or break the build with noUnusedLocals/no-unused-vars. If this was for debugging, consider removing it before merge.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants