-
Notifications
You must be signed in to change notification settings - Fork 100
Lots of new country and corporate type country coverage from the 'disco' fork #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Starts out #93 |
|
So @psolin what do you think? This should be backwards compatible but add a lot of regional coverage. |
psolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be beneficial to sort alphabetically moving forward so we can see if there are any other inconsistencies. Please edit now and I'll review again. Thanks!
|
@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that... |
Great, yes, it should be able to do that, but it'll probably take a lot of tokens. |
- Fixed comments to match PEP8 standard. - classify.py: Changed variable "matches" to matched to better fit outside function. - clean.py: Removed dead import "from collections import OrderedDict". - clean.py: Redundant character escape '\.' in RegExp - All tests passed.
- Added comprehensive test suite with 22 tests organized into 3 categories - All tests follow consistent established format with test data dictionaries and loop-based assertions - Documented known issues: exclamation point removal and empty string results
Okay, I think we're good to go for this PR! The unit tests all pass and things are alphabetized. Once this is in, I'll work on some of the performance improvements. I don't suppose you can take a look and see what you think you'd like to pull over? I didn't make the changes, @marekmodry did but they made
I can make the changes to the rest of the code, but @psolin I'm just wondering if you aren't the better person to do them? :) |
|
@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago. |
|
The test job failure (see below) is caused by an assertion error in tests/test_cleanname.py at line 478, specifically in a test checking unicode non-Latin script handling. The log indicates: AssertionError: unicode non-Latin script test for Greek alphabet failed assert '' == 'Εταιρεία Περ...μένης Ευθύνης' The greek text means "Limited Liability Company" — the Greek legal form equivalent to an LLC/Ltd. Transliteration: Etaireía Periorisménis Efthýnis, or, as often written, Etaireia Periorismenis Efthynis. The |
|
Okay, I'll take a look. |
We found the slow spots and improved them, although I've no idea how to fix the UTF issue... |
|
I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: |
Switched 'Εταιρεία Περιορισμένης Ευθύνης' to 'Ελληνική Επιχείρηση'.
|
Sounds good, thanks. |
Any update? I'd like to get this integrated. |
|
@psolin I am basing this off my other PR, to see if it fixes it. Can you approve my runs plz? Of course at the moment there is duplicate code, but merging the other one first would fix that. I'm not sure how else to do it. |
|
Okay, I think this is ready to go? cc @psolin |
|
@petri I think it needs to run through the test? I am not sure how to do this. |
|
Please approve the workflows so the tests can run. Or make me a contributor. |
|
Ping :) |
|
@psolin the tests pass, can you please merge? If you will review and approve in a timely manner I will work on the library. Otherwise can you give me permission as Contributor? |
If you will approve this PR and merge it, I will move onto the performance enhancements and Chinese company name support. |
|
God bless you every one, sirs! |
I added a bunch of country and company types from the
discofork at https://github.com/rjurney/disco