-
Notifications
You must be signed in to change notification settings - Fork 783
refactor: Improve maybe_is_text for multilingual document support #1179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Improve maybe_is_text for multilingual document support #1179
Conversation
|
test is passed. [success] 42.01% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx-What is the price of a laptop?]: 19.1426s Results (47.95s): |
|
bad_text also started calculating entropy, and when I visualized that entropy, bad_text's value was quite high, so I set it to 8.0 or below. Please let me know if there are any issues. DEBUG: Entropy for string (first 50 chars): This is a test. The sample conc. was 1.0 mM (at 24... is
4.405932127238674
DEBUG: Entropy for string (first 50 chars): \C0\C0\B1... is 2.4464393446710155
DEBUG: Entropy for string (first 50 chars): <!DOCTYPE html>
<html class="client-nojs vector-fe... is 5.260502001738697
DEBUG: Entropy for string (first 50 chars): <!DOCTYPE html>
<html class="client-nojs vector-fe... is 6.078409996564037
DEBUG: Entropy for string (first 50 chars):
ℼ佄呃偙⁅瑨汭ਾ格浴汣獡㵳挢楬湥潮獪瘠捥潴敦瑡牵ⵥ慬杮慵敧椭敨摡牥攭慮汢摥瘠捥潴敦瑡牵ⵥ慬... is 9.00672178581434 |
|
Japanese and Chinese have thousands of kanji characters, but realistically, wouldn't their entropy exceed 8? If there are any issues, I will consider making further revisions. |
|
test is passed [success] 100.00% tests/test_paperqa.py::test_maybe_is_text: 1.0962s
Results (4.21s):
1 passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this 🙏
I've made the changes carefully, but please let me know if you encounter any issues.
Note
Refactors
maybe_is_textto compute entropy via character frequency counts (ignoring spaces) for better multilingual support, and adds a Japanese.docxcase to office document parsing tests.maybe_is_textto useCounter-based per-character frequencies (ignoring spaces) with early empty check; removes reliance onstring.printable.test_parse_office_docto includedummy_jap.docxfor multilingual coverage.Written by Cursor Bugbot for commit 3337b44. Configure here.