Skip to content

Conversation

@takeruhukushima
Copy link
Contributor

@takeruhukushima takeruhukushima commented Nov 4, 2025

I've made the changes carefully, but please let me know if you encounter any issues.


Note

Refactors maybe_is_text to compute entropy via character frequency counts (ignoring spaces) for better multilingual support, and adds a Japanese .docx case to office document parsing tests.

  • Utils:
    • Refactor maybe_is_text to use Counter-based per-character frequencies (ignoring spaces) with early empty check; removes reliance on string.printable.
  • Tests:
    • Extend test_parse_office_doc to include dummy_jap.docx for multilingual coverage.

Written by Cursor Bugbot for commit 3337b44. Configure here.

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Nov 4, 2025
@dosubot
Copy link

dosubot bot commented Nov 4, 2025

Related Documentation

Checked 1 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@takeruhukushima
Copy link
Contributor Author

test is passed.

[success] 42.01% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx-What is the price of a laptop?]: 19.1426s
[success] 20.97% tests/test_paperqa.py::test_parse_office_doc[dummy_jap.docx-What is the RAG system?]: 9.5578s
[success] 20.05% tests/test_paperqa.py::test_parse_office_doc[dummy.docx-What is the RAG system?]: 9.1372s
[success] 16.97% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx-What is the RAG system?]: 7.7340s

Results (47.95s):
4 passed

@dosubot dosubot bot added the enhancement New feature or request label Nov 4, 2025
@takeruhukushima takeruhukushima changed the title feat: Improve maybe_is_text for multilingual document support refactor: Improve maybe_is_text for multilingual document support Nov 4, 2025
@takeruhukushima
Copy link
Contributor Author

bad_text also started calculating entropy, and when I visualized that entropy, bad_text's value was quite high, so I set it to 8.0 or below. Please let me know if there are any issues.

DEBUG: Entropy for string (first 50 chars): This is a test. The sample conc. was 1.0 mM (at 24... is 
  4.405932127238674
  DEBUG: Entropy for string (first 50 chars): \C0\C0\B1... is 2.4464393446710155
  DEBUG: Entropy for string (first 50 chars): <!DOCTYPE html>
  <html class="client-nojs vector-fe... is 5.260502001738697
  DEBUG: Entropy for string (first 50 chars): <!DOCTYPE html>
  <html class="client-nojs vector-fe... is 6.078409996564037
  DEBUG: Entropy for string (first 50 chars): 
  ℼ佄呃偙⁅瑨汭ਾ格浴汣獡㵳挢楬湥⵴潮獪瘠捥潴⵲敦瑡牵ⵥ慬杮慵敧椭⵮敨摡牥攭慮汢摥瘠捥潴⵲敦瑡牵ⵥ慬... is 9.00672178581434

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Nov 4, 2025
@takeruhukushima
Copy link
Contributor Author

Japanese and Chinese have thousands of kanji characters, but realistically, wouldn't their entropy exceed 8?
Frequency seems to be a factor too, but is it mathematically sound?

If there are any issues, I will consider making further revisions.

@takeruhukushima
Copy link
Contributor Author

test is passed

[success] 100.00% tests/test_paperqa.py::test_maybe_is_text: 1.0962s

Results (4.21s):
       1 passed

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this 🙏

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 6, 2025
@jamesbraza jamesbraza merged commit 1b57725 into Future-House:main Nov 6, 2025
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants