Skip to content

Fix template text extraction for Lang, Native name, and Nihongo templates#828

Open
vaibhav45sktech wants to merge 3 commits intodbpedia:masterfrom
vaibhav45sktech:fix-template-text-extraction
Open

Fix template text extraction for Lang, Native name, and Nihongo templates#828
vaibhav45sktech wants to merge 3 commits intodbpedia:masterfrom
vaibhav45sktech:fix-template-text-extraction

Conversation

@vaibhav45sktech
Copy link
Contributor

@vaibhav45sktech vaibhav45sktech commented Jan 27, 2026

Problem

Templates like {{lang|nap|Abbrùzzu}} and {{Nihongo2|東京都}} in Wikipedia infoboxes
were not being extracted, resulting in missing text content in DBpedia.

Root Cause

The Lang template was configured to extract parameter 3, but {{lang}} only has 2 parameters.
Additionally, Native name, Nihongo, and Nihongo2 templates were not configured.

Fix

Updated templatetransform.json:

  • Lang: Extract param 2 (was incorrectly param 3)
  • Native name|native_name: Added - extracts param 2
  • Nihongo2: Added - extracts param 1
  • Nihongo: Added - extracts param 2

Examples

Template Before After
{{lang|nap|Abbrùzzu}} (empty) Abbrùzzu
{{Nihongo2|東京都}} (empty) 東京都

Testing

  • Added test cases to TemplateTransformParserTest.scala
  • Verified configuration with standalone validation script

fixes issue #747

Summary by CodeRabbit

  • New Features

    • Enhanced template parsing to support additional language template formats for improved localization data extraction.
  • Tests

    • Added test coverage for new language template parsing scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Template transformation rules in the configuration are updated to handle new wiki template patterns for native names and Japanese text. Corresponding test cases are added to verify text extraction from these newly supported templates.

Changes

Cohort / File(s) Summary
Template Transformation Configuration
core/src/main/resources/templatetransform.json
Modified Lang replacement rule pattern; added three new public key entries (native_name, Nihongo2, Nihongo) with textNode transformers and corresponding replacement patterns
Template Parser Tests
core/src/test/scala/org/dbpedia/extraction/wikiparser/TemplateTransformParserTest.scala
Added four test cases verifying text extraction from lang, native_name, Nihongo2, and Nihongo wiki template variants

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: fixing template text extraction for three specific template types that are central to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 41-42: The Datatype constructor is being called with only one
argument for langStringDatatype which causes a compile error; update the
initialization of langStringDatatype (the private val langStringDatatype) to
supply the required three parameters (name, labels, comments) or retrieve the
existing datatype from the ontology; for example, use
context.ontology.datatypes.getOrElse("rdf:langString", new
Datatype("rdf:langString", Map.empty[String,String], Map.empty[String,String"]))
so Datatype is constructed with the proper arguments or the ontology-provided
instance is used.
- Line 80: The check "if (genderCounts.isEmpty) return Seq.empty" is incorrect
because genderCounts may contain zero-valued entries even when no pronouns
matched; update the early-exit to check actual matched counts instead—either
remove the check entirely and rely on the later "maxCount >
GenderExtractorConfig.minCount" guard, or replace it with a concrete check such
as "if (genderCounts.values.forall(_ == 0)) return Seq.empty" (or compute
maxCount here and return when maxCount == 0) to ensure we only exit when no
pronouns were matched; reference variables: genderCounts, pronounMap, and
GenderExtractorConfig.minCount within class GenderExtractor.
🧹 Nitpick comments (1)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (1)

66-78: Consider a functional approach for pronoun counting.

The mutable reassignment pattern can be replaced with a more idiomatic functional approach using foldLeft or groupMapReduce.

♻️ Suggested functional alternative
-   var genderCounts: Map[String, Int] =
-     Map.empty.withDefaultValue(0)
-
-   for ((pronoun, gender) <- pronounMap) {
-     val regex =
-       new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-     val count =
-       regex.findAllIn(wikiText).size
-
-     genderCounts =
-       genderCounts.updated(gender, genderCounts(gender) + count)
+   val genderCounts: Map[String, Int] =
+     pronounMap.foldLeft(Map.empty[String, Int].withDefaultValue(0)) {
+       case (counts, (pronoun, gender)) =>
+         val regex = new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
+         val count = regex.findAllIn(wikiText).size
+         counts.updated(gender, counts(gender) + count)
    }

@sonarqubecloud
Copy link

@vaibhav45sktech
Copy link
Contributor Author

Greetings @TallTed , kindly review my pr whenever available .

@TallTed
Copy link
Contributor

TallTed commented Jan 28, 2026

@vaibhav45sktech — Your PR is beyond my scope. Please look into CODEOWNERS and the like.

@vaibhav45sktech
Copy link
Contributor Author

@vaibhav45sktech — Your PR is beyond my scope. Please look into CODEOWNERS and the like.

Thanks @TallTed

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , Could you kindly review my pr whenever available .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants