Fix template text extraction for Lang, Native name, and Nihongo templates#828
Fix template text extraction for Lang, Native name, and Nihongo templates#828vaibhav45sktech wants to merge 3 commits intodbpedia:masterfrom
Conversation
📝 WalkthroughWalkthroughTemplate transformation rules in the configuration are updated to handle new wiki template patterns for native names and Japanese text. Corresponding test cases are added to verify text extraction from these newly supported templates. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 41-42: The Datatype constructor is being called with only one
argument for langStringDatatype which causes a compile error; update the
initialization of langStringDatatype (the private val langStringDatatype) to
supply the required three parameters (name, labels, comments) or retrieve the
existing datatype from the ontology; for example, use
context.ontology.datatypes.getOrElse("rdf:langString", new
Datatype("rdf:langString", Map.empty[String,String], Map.empty[String,String"]))
so Datatype is constructed with the proper arguments or the ontology-provided
instance is used.
- Line 80: The check "if (genderCounts.isEmpty) return Seq.empty" is incorrect
because genderCounts may contain zero-valued entries even when no pronouns
matched; update the early-exit to check actual matched counts instead—either
remove the check entirely and rely on the later "maxCount >
GenderExtractorConfig.minCount" guard, or replace it with a concrete check such
as "if (genderCounts.values.forall(_ == 0)) return Seq.empty" (or compute
maxCount here and return when maxCount == 0) to ensure we only exit when no
pronouns were matched; reference variables: genderCounts, pronounMap, and
GenderExtractorConfig.minCount within class GenderExtractor.
🧹 Nitpick comments (1)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (1)
66-78: Consider a functional approach for pronoun counting.The mutable reassignment pattern can be replaced with a more idiomatic functional approach using
foldLeftorgroupMapReduce.♻️ Suggested functional alternative
- var genderCounts: Map[String, Int] = - Map.empty.withDefaultValue(0) - - for ((pronoun, gender) <- pronounMap) { - val regex = - new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b") - - val count = - regex.findAllIn(wikiText).size - - genderCounts = - genderCounts.updated(gender, genderCounts(gender) + count) + val genderCounts: Map[String, Int] = + pronounMap.foldLeft(Map.empty[String, Int].withDefaultValue(0)) { + case (counts, (pronoun, gender)) => + val regex = new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b") + val count = regex.findAllIn(wikiText).size + counts.updated(gender, counts(gender) + count) }
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Outdated
Show resolved
Hide resolved
|
|
Greetings @TallTed , kindly review my pr whenever available . |
|
@vaibhav45sktech — Your PR is beyond my scope. Please look into CODEOWNERS and the like. |
Thanks @TallTed |
|
Greetings @jimkont , Could you kindly review my pr whenever available . |



Problem
Templates like
{{lang|nap|Abbrùzzu}}and{{Nihongo2|東京都}}in Wikipedia infoboxeswere not being extracted, resulting in missing text content in DBpedia.
Root Cause
The
Langtemplate was configured to extract parameter 3, but{{lang}}only has 2 parameters.Additionally,
Native name,Nihongo, andNihongo2templates were not configured.Fix
Updated templatetransform.json:
Examples
{{lang|nap|Abbrùzzu}}{{Nihongo2|東京都}}Testing
fixes issue #747
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.