Skip to content

SOLR-18208: Replace abandoned langdetect dependency with maintained fork#4326

Merged
janhoy merged 7 commits intoapache:mainfrom
janhoy:feature/SOLR-18208-replace-langdetect-dependency
Apr 28, 2026
Merged

SOLR-18208: Replace abandoned langdetect dependency with maintained fork#4326
janhoy merged 7 commits intoapache:mainfrom
janhoy:feature/SOLR-18208-replace-langdetect-dependency

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented Apr 23, 2026

Replace com.cybozu.labs:langdetect (abandoned since 2012) with io.github.azagniotov:language-detection:12.5.2, a maintained fork with an active release history.

The new library bundles its own language profiles, so the 53 profile files previously shipped in the langid module resources are removed. The factory no longer loads profiles at startup; it creates a shared LanguageDetectionOrchestrator instead. The processor converts the field-content Reader to a String and calls orchestrator.detectAll().

commons-io was only used for profile loading and is also removed from the langid module dependencies.

Some tests needed reworking to pass due to different behavior of the libraries, and that this new supports more languages, which introduces some ambiguity.

https://issues.apache.org/jira/browse/SOLR-18208

Implemented entirely by Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates Solr’s langid module from the abandoned com.cybozu.labs:langdetect to the maintained fork io.github.azagniotov:language-detection (SOLR-18208), removing the legacy bundled profile resources and updating processor wiring, dependencies, tests, and license metadata accordingly.

Changes:

  • Replace com.cybozu.labs:langdetect usage with io.github.azagniotov:language-detection and update the update-processor factory/processor integration.
  • Remove shipped langdetect-profiles/* resources and the related RAT exclusion.
  • Update Gradle dependency catalogs/lockfiles and add the new dependency’s LICENSE/NOTICE/SHA1 files; adjust tests for behavior differences.

Reviewed changes

Copilot reviewed 45 out of 69 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
solr/modules/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactory.java Builds and supplies a shared LanguageDetectionOrchestrator to processor instances; removes old profile-loading code.
solr/modules/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java Switches detection to orchestrator.detectAll() and maps results to Solr’s DetectedLanguage.
solr/modules/langid/src/test/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactoryTest.java Adjusts test samples/expectations for the new detector’s behavior and adds a replacement multivalue test.
solr/modules/langid/build.gradle Drops old deps (commons-io, cybozu langdetect) and adds the new language-detection dependency alias.
solr/modules/langid/gradle.lockfile Removes old locked artifacts and adds io.github.azagniotov:language-detection:12.5.2.
gradle/libs.versions.toml Adds version + catalog entry for io.github.azagniotov:language-detection.
gradle/validation/rat-sources.gradle Removes the langdetect-profiles/* exclusion now that those resources are deleted.
solr/licenses/language-detection-LICENSE-ASL.txt Adds the Apache 2.0 license text for the new dependency.
solr/licenses/language-detection-NOTICE.txt Adds NOTICE metadata for the new dependency.
solr/licenses/language-detection-12.5.2.jar.sha1 Adds checksum for the new dependency jar.
solr/licenses/langdetect-NOTICE.txt Removes NOTICE metadata for the old dependency.
solr/licenses/langdetect-LICENSE-ASL.txt Removes license file for the old dependency.
solr/licenses/langdetect-1.1-20120112.jar.sha1 Removes checksum for the old dependency jar.
solr/licenses/jsonic-NOTICE.txt Removes NOTICE metadata for jsonic (previously pulled in by old langdetect).
solr/licenses/jsonic-1.2.7.jar.sha1 Removes checksum for jsonic.
changelog/unreleased/SOLR-18208-replace-langdetect.yml Adds an unreleased changelog entry for the dependency replacement.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/af Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/gu Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/id Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/it Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/ko Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/so Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/sq Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/sw Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/tl Removes legacy bundled profile resource.
solr/modules/langid/src/resources/org/apache/solr/update/processor/langdetect-profiles/vi Removes legacy bundled profile resource.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A joy to read the code!

I think my one request is taht some of the javadcos make sense in explaining to a reader teh changes you are mkaing, but once merged, are confusing. Maybe rework the javadocs to be explaining the nuances of the library, but without referring to the previous version? And any "previous version/current version" notes should go in the Major Changes? There is good stuff there to educate someone upgrading.

janhoy added 6 commits April 26, 2026 20:28
Replace com.cybozu.labs:langdetect (abandoned since 2012) with
io.github.azagniotov:language-detection:12.5.2, a maintained fork
with an active release history.

The new library bundles its own language profiles, so the 53 profile
files previously shipped in the langid module resources are removed.
The factory no longer loads profiles at startup; it creates a shared
LanguageDetectionOrchestrator instead. The processor converts the
field-content Reader to a String and calls orchestrator.detectAll().

commons-io was only used for profile loading and is also removed from
the langid module dependencies.
…adoc

- Rename _parser → parser in base test class and all subclasses
- Remove verbose PR-context Javadoc from test overrides; keep only
  what is useful for a new developer reading the code
@janhoy janhoy force-pushed the feature/SOLR-18208-replace-langdetect-dependency branch from c8fff37 to f7880dd Compare April 26, 2026 18:51
@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented Apr 28, 2026

I think my one request is taht some of the javadcos make sense in explaining to a reader teh changes you are mkaing, but once merged, are confusing.

Thanks, that was eager LLM documentation going on. Fixed it, will merge soon.

@janhoy janhoy merged commit db36284 into apache:main Apr 28, 2026
5 checks passed
@janhoy janhoy deleted the feature/SOLR-18208-replace-langdetect-dependency branch April 28, 2026 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:index dependencies Dependency upgrades documentation Improvements or additions to documentation module:langid tests tool:build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants