Skip to content

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824

Open
vaibhav45sktech wants to merge 1 commit intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor
Open

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824
vaibhav45sktech wants to merge 1 commit intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor

Conversation

@vaibhav45sktech
Copy link
Contributor

@vaibhav45sktech vaibhav45sktech commented Jan 24, 2026

Replaces hardcoded URI strings with context.ontology lookups and improves code quality.

Changes:

  • Use context.ontology.properties() and context.ontology.classes() instead of raw URIs
  • Fix pronoun regex: word boundaries + case-insensitive + proper escaping
  • Pre-instantiate langStringDatatype at class level
  • Handle division-by-zero in gender ratio calculation
  • Clean up whitespace and formatting

Resolves issue #825

Summary by CodeRabbit

Release Notes

  • Improvements
    • Enhanced gender extraction with improved validation for entity recognition
    • Better language-specific output formatting in results

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

Refactors GenderExtractor to use ontology-derived properties instead of hard-coded strings, introduces explicit language code extraction and gender mapping from configuration, adds Person-type verification, reworks pronoun counting with case-insensitive word-boundary matching, and implements threshold-based gender determination with ratio validation.

Changes

Cohort / File(s) Summary
Gender Extractor Logic Refactoring
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Replaces string-constant predicates with context.ontology lookups (foaf:gender, rdf:type); adds language code extraction and pronoun mapping from GenderExtractorConfig; introduces Person-type verification check; reworks text analysis to count pronouns with case-insensitive word-boundary matching; replaces pairwise max/min comparisons with sorted-count approach; implements explicit threshold logic (minDifference, minCount) for gender determination; outputs single Quad with langString datatype only when thresholds are met

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main refactoring changes: replacing hardcoded URIs with ontology lookups and improving regex patterns with proper word boundaries.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont ,Kindly review my pr whenever available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant