Sisterproject extractor #743

TallTed · 2022-12-20T19:04:18Z

Suggested change

* Extracted information is represented using properties in the http://xx.dbpedia.org/property/

* namespace (where xx is the language code).

* Extracted information is represented using properties in the

* <http://xx.dbpedia.org/property/> namespace (where xx is the language code).

TallTed · 2025-10-15T19:27:26Z

Suggested change

* The names of the these properties directly reflect the name of the Wikipedia infobox property.

* The names of the these properties directly reflect the names of the Wikipedia infobox properties.

TallTed · 2022-12-20T19:06:12Z

Suggested change

* Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset.

* Property types are not part of a subsumption hierarchy, and there is no consistent ontology for the infobox dataset.

coderabbitai · 2025-10-09T00:19:17Z

⚠️ Potential issue | 🟡 Minor

Update class documentation to reflect the extractor's actual purpose.

The documentation appears to be copied from InfoboxExtractor and describes infobox property extraction rather than sister project link extraction. It should be updated to explain that this extractor retrieves links to sister projects (Commons, Wiktionary, Wikiquote, etc.) from infobox templates.

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala around lines 20 to 29, the class Scaladoc currently describes infobox property extraction (copied from InfoboxExtractor) but this extractor actually extracts links to sister projects (Commons, Wiktionary, Wikiquote, etc.) from infobox templates; update the class documentation to accurately describe that the extractor locates and converts sister-project interwiki/infobox links into DBpedia properties (naming the target sister projects, expected input templates/fields, output property namespace, and any value-normalization performed), replacing the erroneous infobox-focused text with a concise explanation of purpose and behavior.

coderabbitai · 2025-10-09T00:19:17Z

⚠️ Potential issue | 🔴 Critical

Replace owl:sameAs with an appropriate predicate.

As discussed in the PR comments, owl:sameAs is semantically incorrect for sister project links because it denotes strict co-reference (subject and object identify the same entity). A Wiktionary page is not the same entity as a Wikipedia article.

Consider using skos:related or creating dedicated properties for each sister project type (e.g., dbo:wiktionaryPage, dbo:commonsPage), as suggested by datalogism in the PR discussion.

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala around line 76, the code currently uses context.ontology.properties("owl:sameAs") which is semantically incorrect for sister project links; replace this with an appropriate predicate such as "skos:related" or, preferably, add and use dedicated properties per project (e.g., "dbo:wiktionaryPage", "dbo:commonsPage") in the ontology and then reference them via context.ontology.properties("<predicate>"); update the variable name to reflect the new predicate (e.g., relatedProperty or wiktionaryProperty), change any downstream uses to emit the new predicate, and add/update tests/fixtures and ontology configuration to include the chosen predicates.

TallTed · 2025-10-15T19:28:48Z

Suggested change

/// NEED TO BE EXTENDED TO OTHERS LANGUAGES

/// NEED TO BE EXTENDED TO OTHER LANGUAGES

coderabbitai · 2025-10-09T00:19:17Z

⚠️ Potential issue | 🟠 Major

Guard against unsupported language codes.

Line 88 accesses regexMap(wikiCode) directly, which will throw NoSuchElementException if the language code is not in the map. Since only three languages are currently supported (fr, en, de), this extractor will fail for all other languages.

Apply this diff to fail gracefully:

- private val currentRegexList = regexMap(wikiCode) + private val currentRegexList = regexMap.getOrElse(wikiCode, { + throw new IllegalArgumentException(s"SisterProjectExtractor does not support language code: $wikiCode. Supported languages: ${regexMap.keys.mkString(", ")}") + })

Alternatively, return an empty list to skip processing for unsupported languages:

- private val currentRegexList = regexMap(wikiCode) + private val currentRegexList = regexMap.getOrElse(wikiCode, List.empty)

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala around lines 80 to 88, the code directly indexes regexMap with wikiCode which throws NoSuchElementException for unsupported languages; update it to guard lookup (use Map.get or contains) and either default to an empty List when wikiCode is not present or handle the missing case by logging/warning and skipping processing, ensuring currentRegexList is assigned via a safe getOrElse so the extractor fails gracefully for unsupported languages.

coderabbitai · 2025-10-09T00:19:17Z

⚠️ Potential issue | 🟡 Minor

Remove outdated comment.

The comment "USELESS ?" is incorrect since linkParser is used in the extractLinks method at line 197.

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala around lines 91 to 92, remove the outdated "USELESS ?" comment that incorrectly claims linkParser is useless—linkParser is used by extractLinks at line 197; delete that comment (or replace it with a short accurate comment describing linkParser’s purpose) so the code comment no longer misleads.

coderabbitai · 2025-10-09T00:19:17Z

⚠️ Potential issue | 🔴 Critical

Fix incorrect URL for Wikiversity.

Line 113 maps "wikiversity" to the Wikibooks URL (https://wikibooks.org/wiki/) instead of the correct Wikiversity URL.

Apply this diff:

"wikisource" -> "https://wikisource.org/wiki/", "wikibooks" -> "https://wikibooks.org/wiki/", - "wikiversity" -> "https://wikibooks.org/wiki/", + "wikiversity" -> "https://wikiversity.org/wiki/", "species" -> "https://species.wikimedia.org/wiki/"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

private val mapProjects = Map(

"commons" -> "http://commons.dbpedia.org/resource/",

"wiktionary" -> "https://wiktionary.org/wiki/",

"wikinews" -> "https://wikinews.org/wiki/",

"wikivoyage" -> "https://wikivoyage.org/wiki/",

"wikiquote" -> "https://wikiquote.org/wiki/",

"wikisource" -> "https://wikisource.org/wiki/",

"wikibooks" -> "https://wikibooks.org/wiki/",

"wikiversity" -> "https://wikibooks.org/wiki/",

"species" -> "https://species.wikimedia.org/wiki/"

)

private val mapProjects = Map(

"commons" -> "http://commons.dbpedia.org/resource/",

"wiktionary" -> "https://wiktionary.org/wiki/",

"wikinews" -> "https://wikinews.org/wiki/",

"wikivoyage" -> "https://wikivoyage.org/wiki/",

"wikiquote" -> "https://wikiquote.org/wiki/",

"wikisource" -> "https://wikisource.org/wiki/",

"wikibooks" -> "https://wikibooks.org/wiki/",

"wikiversity" -> "https://wikiversity.org/wiki/",

"species" -> "https://species.wikimedia.org/wiki/"

)

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala around lines 105 to 116, the mapProjects entry for "wikiversity" incorrectly points to the Wikibooks URL; update the value for the "wikiversity" key to the correct Wikiversity base URL "https://wikiversity.org/wiki/". Ensure only that map entry is changed and the rest of the Map remains intact.

coderabbitai · 2025-10-09T00:19:17Z

⚠️ Potential issue | 🟠 Major

Verify and fix URL construction logic.

The URL construction logic appears problematic:

Line 153: mapProjects(property.key).replace(property.key, language.wikiCode + "." + property.key)

For "commons" property with URL "http://commons.dbpedia.org/resource/", replacing "commons" with "en.commons" would produce "http://en.commons.dbpedia.org/resource/", which may not be the intended URL structure.

For "wiktionary" with URL "https://wiktionary.org/wiki/", the same replace would produce "https://en.wiktionary.org/wiki/", which seems correct but is fragile.

The logic assumes the property key appears exactly once in the URL string at the right position, which is fragile and error-prone.

Consider building the URLs explicitly rather than using string replacement:

if (mapProjects.contains(property.key)) { val baseProject = property.key val value = baseProject match { case "commons" => s"http://commons.dbpedia.org/resource/${WikiUtil.wikiEncode(pr.value)}" case _ => s"https://${language.wikiCode}.${baseProject}.org/wiki/${WikiUtil.wikiEncode(pr.value)}" } quads += new Quad(language, DBpediaDatasets.SisterProjectLink, subjectUri, sameAsProperty, value, splitNode.sourceIri, null) }

🤖 Prompt for AI Agents

core/src/main/scala/org/dbpedia/extraction/mappings/SisterProjectExtractor.scala around lines 152 to 160: the current URL construction uses string.replace on mapProjects(...) which is fragile and can produce incorrect hostnames (e.g. replacing "commons" produces "en.commons.dbpedia.org"); instead construct the target URL explicitly based on the project key and language: treat commons (and any other special-case projects) with their correct fixed base (e.g. "http://commons.dbpedia.org/resource/") and for standard wiki projects build "https://{language.wikiCode}.{project}.org/wiki/{WikiUtil.wikiEncode(pr.value)}"; apply the same explicit-construction logic to the mapAbrev branch (use the mapped keyProj then build the URL), remove the string.replace usage, and ensure WikiUtil.wikiEncode(pr.value) is appended to the constructed base.

-Original file line number
+Diff line change
@@ Expand Up / @@ -461,6 +461,12 @@ @@
           "traits":"LinkedData, Published",
           "desc": "Dataset containing links between a language-locale DBpedia instance and DBpedia Commons.",
           "defaultgraph": "namespace"
+        },
+        "sisterprojects_page_links": {
+          "name": "Sister Projects Links",
+          "traits":"LinkedData, Published",
+          "desc": "Dataset containing links between a language-locale DBpedia instance and Sister projects.",
+          "defaultgraph": "namespace"
         }
       },
       "wikidata":{
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -230,6 +230,7 @@ object DBpediaDatasets extends java.io.Serializable @@
         val FreebaseLinks: Dataset = datasets("freebase_links")
         val GeonamesLinks: Dataset = datasets("geonames_links")
         val CommonsLink: Dataset = datasets("commons_page_links")
+        val SisterProjectLink: Dataset = datasets("sisterprojects_page_links")
         /**
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sisterproject extractor #743

Uh oh!

Diff view

Diff view

There are no files selected for viewing

TallTed Dec 20, 2022

Uh oh!

TallTed Oct 15, 2025

Uh oh!

TallTed Dec 20, 2022

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

TallTed Oct 15, 2025

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

Sisterproject extractor #743

Are you sure you want to change the base?

Uh oh!

Sisterproject extractor #743

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

TallTed Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

TallTed Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

TallTed Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

TallTed Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!