Skip to content

Conversation

@rjurney
Copy link
Contributor

@rjurney rjurney commented Nov 12, 2025

I added a bunch of country and company types from the disco fork at https://github.com/rjurney/disco

@rjurney
Copy link
Contributor Author

rjurney commented Nov 12, 2025

Starts out #93

@rjurney
Copy link
Contributor Author

rjurney commented Nov 13, 2025

So @psolin what do you think? This should be backwards compatible but add a lot of regional coverage.

Copy link
Owner

@psolin psolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be beneficial to sort alphabetically moving forward so we can see if there are any other inconsistencies. Please edit now and I'll review again. Thanks!

@rjurney
Copy link
Contributor Author

rjurney commented Nov 20, 2025

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

@psolin
Copy link
Owner

psolin commented Nov 20, 2025

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Great, yes, it should be able to do that, but it'll probably take a lot of tokens.

psolin and others added 3 commits November 20, 2025 19:00
- Fixed comments to match PEP8 standard.
- classify.py: Changed variable "matches" to matched to better fit outside function.
- clean.py: Removed dead import "from collections import OrderedDict".
- clean.py: Redundant character escape '\.' in RegExp
- All tests passed.
  - Added comprehensive test suite with 22 tests organized into 3 categories
  - All tests follow consistent established format with test data dictionaries and loop-based assertions
  - Documented known issues: exclamation point removal and empty string results
@rjurney
Copy link
Contributor Author

rjurney commented Nov 21, 2025

@psolin I made the changes you requested except... you want me to sort everything alphabetically? Let me see if Claude Code can do that...

Great, yes, it should be able to do that, but it'll probably take a lot of tokens.

Okay, I think we're good to go for this PR! The unit tests all pass and things are alphabetized. Once this is in, I'll work on some of the performance improvements. I don't suppose you can take a look and see what you think you'd like to pull over?

I didn't make the changes, @marekmodry did but they made cleanco (renamed disco) much faster. See for example:

I can make the changes to the rest of the code, but @psolin I'm just wondering if you aren't the better person to do them? :)

@petri
Copy link
Collaborator

petri commented Nov 21, 2025

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

@petri
Copy link
Collaborator

petri commented Nov 21, 2025

The test job failure (see below) is caused by an assertion error in tests/test_cleanname.py at line 478, specifically in a test checking unicode non-Latin script handling. The log indicates:

AssertionError: unicode non-Latin script test for Greek alphabet failed assert '' == 'Εταιρεία Περ...μένης Ευθύνης'

The greek text means "Limited Liability Company" — the Greek legal form equivalent to an LLC/Ltd.

Transliteration: Etaireía Periorisménis Efthýnis, or, as often written, Etaireia Periorismenis Efthynis.
Abbreviation: Ε.Π.Ε. (E.P.E.)

The basename function is returning an empty string instead of the expected Greek text. Other tests pass - Greek seems to be the key issue).

@rjurney
Copy link
Contributor Author

rjurney commented Nov 21, 2025

Okay, I'll take a look.

@rjurney
Copy link
Contributor Author

rjurney commented Nov 22, 2025

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

We found the slow spots and improved them, although I've no idea how to fix the UTF issue...

@psolin
Copy link
Owner

psolin commented Nov 22, 2025

I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.

Switched 'Εταιρεία Περιορισμένης Ευθύνης' to 'Ελληνική Επιχείρηση'.
@rjurney
Copy link
Contributor Author

rjurney commented Nov 23, 2025

Sounds good, thanks.

@rjurney
Copy link
Contributor Author

rjurney commented Dec 6, 2025

I think that the testing may be off. The test should preserve the entire Greek string "Εταιρεία Περιορισμένης Ευθύνης", but since it is a term now, it removes it, creating a blank string and failing the test. I think that the test needs to read something like: "Greek alphabet": ('Ελληνική Επιχείρηση', 'Ελληνική Επιχείρηση'),, since "Εταιρεία Περιορισμένης Ευθύνης" is literally a term now and not just a generic name for a business. I can update that now, and then this should pass.

Any update? I'd like to get this integrated.

@rjurney
Copy link
Contributor Author

rjurney commented Dec 6, 2025

@psolin I am basing this off my other PR, to see if it fixes it. Can you approve my runs plz? Of course at the moment there is duplicate code, but merging the other one first would fix that. I'm not sure how else to do it.

@rjurney rjurney requested a review from psolin December 6, 2025 19:13
@rjurney
Copy link
Contributor Author

rjurney commented Dec 6, 2025

Okay, I think this is ready to go? cc @psolin

@psolin
Copy link
Owner

psolin commented Dec 7, 2025

@petri I think it needs to run through the test? I am not sure how to do this.

@rjurney
Copy link
Contributor Author

rjurney commented Dec 7, 2025

Please approve the workflows so the tests can run. Or make me a contributor.

@rjurney
Copy link
Contributor Author

rjurney commented Dec 14, 2025

Ping :)

@rjurney
Copy link
Contributor Author

rjurney commented Dec 15, 2025

@psolin the tests pass, can you please merge? If you will review and approve in a timely manner I will work on the library. Otherwise can you give me permission as Contributor?

@rjurney
Copy link
Contributor Author

rjurney commented Dec 15, 2025

@rjurney given it appears you're more recently familiar with the performance issues & the improvements, it would be great if you could submit those as PR(s). Using LRU cache seems a nice improvement, although I never went as far as that. Did you actually profile the code to find the hotspots or how did you choose what to improve? IIRC there were some obvious things, too, the last I worked on it long time ago.

If you will approve this PR and merge it, I will move onto the performance enhancements and Chinese company name support.

@psolin psolin merged commit cb70b70 into psolin:master Dec 15, 2025
7 checks passed
@rjurney
Copy link
Contributor Author

rjurney commented Dec 15, 2025

God bless you every one, sirs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants