refactor: Improve maybe_is_text for multilingual document support #1179

takeruhukushima · 2025-11-04T10:11:01Z

I've made the changes carefully, but please let me know if you encounter any issues.

Note

Refactors maybe_is_text to compute entropy via character frequency counts (ignoring spaces) for better multilingual support, and adds a Japanese .docx case to office document parsing tests.

Utils:
- Refactor maybe_is_text to use Counter-based per-character frequencies (ignoring spaces) with early empty check; removes reliance on string.printable.
Tests:
- Extend test_parse_office_doc to include dummy_jap.docx for multilingual coverage.

^{Written by Cursor Bugbot for commit 3337b44. Configure here.}

dosubot · 2025-11-04T10:11:15Z

Related Documentation

Checked 1 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

takeruhukushima · 2025-11-04T10:11:56Z

test is passed.

[success] 42.01% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx-What is the price of a laptop?]: 19.1426s
[success] 20.97% tests/test_paperqa.py::test_parse_office_doc[dummy_jap.docx-What is the RAG system?]: 9.5578s
[success] 20.05% tests/test_paperqa.py::test_parse_office_doc[dummy.docx-What is the RAG system?]: 9.1372s
[success] 16.97% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx-What is the RAG system?]: 7.7340s

Results (47.95s):
4 passed

src/paperqa/utils.py

…stness

takeruhukushima · 2025-11-04T23:34:16Z

bad_text also started calculating entropy, and when I visualized that entropy, bad_text's value was quite high, so I set it to 8.0 or below. Please let me know if there are any issues.

DEBUG: Entropy for string (first 50 chars): This is a test. The sample conc. was 1.0 mM (at 24... is 
  4.405932127238674
  DEBUG: Entropy for string (first 50 chars): \C0\C0\B1... is 2.4464393446710155
  DEBUG: Entropy for string (first 50 chars): <!DOCTYPE html>
  <html class="client-nojs vector-fe... is 5.260502001738697
  DEBUG: Entropy for string (first 50 chars): <!DOCTYPE html>
  <html class="client-nojs vector-fe... is 6.078409996564037
  DEBUG: Entropy for string (first 50 chars): 
  ℼ佄呃偙⁅瑨汭ਾ格浴⁬汣獡㵳挢楬湥⵴潮獪瘠捥潴⵲敦瑡牵ⵥ慬杮慵敧椭⵮敨摡牥攭慮汢摥瘠捥潴⵲敦瑡牵ⵥ慬... is 9.00672178581434

…hima/paper-qa into multilingual_is_text

src/paperqa/utils.py

takeruhukushima · 2025-11-05T01:56:48Z

Japanese and Chinese have thousands of kanji characters, but realistically, wouldn't their entropy exceed 8?
Frequency seems to be a factor too, but is it mathematically sound?

If there are any issues, I will consider making further revisions.

takeruhukushima · 2025-11-05T01:58:04Z

test is passed

[success] 100.00% tests/test_paperqa.py::test_maybe_is_text: 1.0962s

Results (4.21s):
       1 passed

tests/test_paperqa.py

src/paperqa/utils.py

tests/test_paperqa.py

jamesbraza

Nice work on this 🙏

takeruhukushima added 2 commits November 4, 2025 19:06

edit test_paperqa.py adn add dummy_jap.docx

39b8b17

feat: Improve maybe_is_text for multilingual document support

3337b44

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Nov 4, 2025

dosubot bot added the enhancement New feature or request label Nov 4, 2025

takeruhukushima changed the title ~~feat: Improve maybe_is_text for multilingual document support~~ refactor: Improve maybe_is_text for multilingual document support Nov 4, 2025

jamesbraza reviewed Nov 4, 2025

View reviewed changes

src/paperqa/utils.py Show resolved Hide resolved

takeruhukushima and others added 2 commits November 5, 2025 08:30

feat: Enhance maybe_is_text for multilingual support and improve robu…

b5a99f9

…stness

[pre-commit.ci lite] apply automatic fixes

2151683

takeruhukushima added 2 commits November 5, 2025 08:38

fix: Improve maybe_is_text robustness and resolve linting issues

17943ec

Merge branch 'multilingual_is_text' of https://github.com/takeruhukus…

b8f6d3f

…hima/paper-qa into multilingual_is_text

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Nov 4, 2025

jamesbraza reviewed Nov 5, 2025

View reviewed changes

src/paperqa/utils.py Show resolved Hide resolved

takeruhukushima added 2 commits November 5, 2025 10:52

feat: Add high-entropy test cases for maybe_is_text

186a09a

delete #

4cb08cd

jamesbraza reviewed Nov 5, 2025

View reviewed changes

tests/test_paperqa.py Outdated Show resolved Hide resolved

tests/test_paperqa.py Show resolved Hide resolved

takeruhukushima added 2 commits November 5, 2025 13:49

refactor: Move random and string imports to module level

a4ff499

fix(tests): Update expected counts and imports for new test cases

b5c1220

jamesbraza reviewed Nov 5, 2025

View reviewed changes

src/paperqa/utils.py Show resolved Hide resolved

tests/test_paperqa.py Show resolved Hide resolved

tests/test_paperqa.py Outdated Show resolved Hide resolved

takeruhukushima and others added 4 commits November 6, 2025 07:25

Refactor: Refresh test_maybe_is_text HTTP cassette

ab015bd

Refactor: Minimize comments and refresh HTTP cassette

909d59b

Restoring comments

917b5f1

Further adjusting expected counts in test_get_directory_index

f4d81b5

jamesbraza approved these changes Nov 6, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 6, 2025

mskarlin approved these changes Nov 6, 2025

View reviewed changes

jamesbraza merged commit 1b57725 into Future-House:main Nov 6, 2025
3 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Improve maybe_is_text for multilingual document support #1179

refactor: Improve maybe_is_text for multilingual document support #1179

takeruhukushima commented Nov 4, 2025 •

edited by cursor bot

Loading

Uh oh!

dosubot bot commented Nov 4, 2025

Uh oh!

takeruhukushima commented Nov 4, 2025

Uh oh!

Uh oh!

takeruhukushima commented Nov 4, 2025

Uh oh!

Uh oh!

takeruhukushima commented Nov 5, 2025

Uh oh!

takeruhukushima commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jamesbraza left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor: Improve maybe_is_text for multilingual document support #1179

refactor: Improve maybe_is_text for multilingual document support #1179

Conversation

takeruhukushima commented Nov 4, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Nov 4, 2025

Uh oh!

takeruhukushima commented Nov 4, 2025

Uh oh!

Uh oh!

takeruhukushima commented Nov 4, 2025

Uh oh!

Uh oh!

takeruhukushima commented Nov 5, 2025

Uh oh!

takeruhukushima commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

takeruhukushima commented Nov 4, 2025 •

edited by cursor bot

Loading