feat: using charset_normalizer instead of chardet #4

hila-f-qodo · 2025-12-31T19:15:34Z

augmentcode · 2025-12-31T19:18:27Z

🤖 Augment PR Summary

Summary: This PR replaces chardet with charset-normalizer for encoding detection/decoding.

Changes:

Update file encoding detection helper to use charset_normalizer.from_path(...).best()
Update web URL reader to detect encoding via from_bytes(...).best() before decoding
Update document extractor decoding for plain text/JSON/YAML/CSV to use charset-normalizer
Swap the dependency in api/pyproject.toml/api/uv.lock and adjust unit tests

Technical Notes: Encoding detection semantics differ slightly (e.g., confidence/coherence and decode fallbacks), so behavior on ambiguous/invalid byte sequences may change.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-31T19:18:27Z

api/core/rag/extractor/helpers.py

-            rawdata = f.read(sample_size)
-        return cast(list[dict], chardet.detect_all(rawdata))
+    def read_and_detect(filename: str):
+        rst = charset_normalizer.from_path(filename)


detect_file_encodings() still accepts sample_size, but the new charset_normalizer.from_path() path ignores it, so large files may be fully read despite the docstring claiming sampling to prevent timeouts. This can increase memory/latency and make the timeout behavior less predictable.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-31T19:18:27Z

api/core/workflow/nodes/document_extractor/node.py

            encoding = "utf-8"

-        return file_content.decode(encoding, errors="ignore")
+        return file_content.decode(encoding, errors="strict")


Switching plain-text decoding to errors="strict" can cause a UnicodeDecodeError on otherwise mostly-decodable content and then fall back to UTF-8, potentially losing the correctly-detected encoding’s output. This seems like a behavior regression compared to the prior errors="ignore" approach.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2025-12-31T19:18:27Z

web/app/components/app-sidebar/toggle-button.tsx

  handleToggle,
  className,
 }: ToggleButtonProps) => {
+  const unusedVar = 'This variable is not used'


unusedVar is declared but never used; depending on TS/ESLint settings this can fail CI or break the build with noUnusedLocals/no-unused-vars. If this was for debugging, consider removing it before merge.

_{🤖 Was this useful? React with 👍 or 👎}

fatelei and others added 2 commits December 2, 2025 15:44

feat: using charset_normalizer instead of chardet

081f33e

Apply changes for benchmark PR

0cac69c

augmentcode bot reviewed Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: using charset_normalizer instead of chardet #4

feat: using charset_normalizer instead of chardet #4

Uh oh!

hila-f-qodo commented Dec 31, 2025

Uh oh!

augmentcode bot commented Dec 31, 2025

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Dec 31, 2025

Uh oh!

augmentcode bot Dec 31, 2025

Uh oh!

augmentcode bot Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: using charset_normalizer instead of chardet #4

Are you sure you want to change the base?

feat: using charset_normalizer instead of chardet #4

Uh oh!

Conversation

hila-f-qodo commented Dec 31, 2025

Uh oh!

augmentcode bot commented Dec 31, 2025

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants