Skip to content

Tutorial

jlareck edited this page Sep 17, 2020 · 7 revisions

How to start developing new Extractor

Wikidata extractors overview

The Wikidata pages are represented in xml+json format. For parsing all the data from Wikidata entities, the Extraction Framework uses Wikidata Toolkit Library. Wikidata Toolkit library is the main library for wikidata extractors because it contains all necessary methods for getting data from each entity.

First steps

Firstly you need to create a class that is extended from JsonNodeExtractor. Then you need to implement extract method. E.g. override def extract(page: JsonNode, subjectUri: String): Seq[Quad] . The extract method is the method that is responsible for the data extraction. As you can see it contains a parameter with JsonNode class type. JsonNode class has two variables: wikiPage and wikiDataDocument. Json Node class:

class JsonNode  (
                  val wikiPage : WikiPage,
                  val wikiDataDocument : JsonDeserializer
                  )
  extends Node(List.empty, 0) {
  def toPlainText: String = ""
  def toWikiText: String = ""
}

The WikiPage class represents the Wikipedia/Wikidata/Wikmedia Commons page. It consist of many fields that contains many different variables: id, title, redirect, revision, timestamp, contributorID, contributorName, source, format. Wikitext of the page from which you want to extract the data you can find in source field.

So, if you want to get data from wikitext, you need to deserialize the source. But before this you need to check what type of entity you have because if the input data is Wikidata Etem but you want it to deserialize as Wikidata Lexeme, you will get the error. To prevent this error in each wikidata extractor we check the type of entity by namespace. Let's look at the Wikidata Property Extractor:

  override def extract(page: JsonNode, subjectUri: String): Seq[Quad] = {
    val quads = new ArrayBuffer[Quad]()

    val subject = WikidataUtil.getWikidataNamespace(subjectUri).replace("Property:", "")

    quads ++= getAliases(page, subject)
    quads ++= getDescriptions(page, subject)
    quads ++= getLabels(page, subject)
    quads ++= getStatements(page, subject)


    quads
  }

  private def getAliases(document: JsonNode, subjectUri: String): Seq[Quad] = {
    val quads = new ArrayBuffer[Quad]()

    if (document.wikiPage.title.namespace == Namespace.WikidataProperty) {
      val page = document.wikiDataDocument.deserializePropertyDocument(document.wikiPage.source)
      for ((lang, value) <- page.getAliases) {
        val alias = WikidataUtil.replacePunctuation(value.toString, lang)
        Language.get(lang) match {
          case Some(dbpedia_lang) => {
            quads += new Quad(dbpedia_lang, DBpediaDatasets.WikidataProperty, subjectUri, aliasProperty, alias,
              document.wikiPage.sourceIri, context.ontology.datatypes("rdf:langString"))
          }
          case _ =>
        }
      }
    }
    quads
  }

The if (document.wikiPage.title.namespace == Namespace.WikidataProperty) checks by namespace if the type of our input page is Wikidata Property.

Here val page = document.wikiDataDocument.deserializePropertyDocument(document.wikiPage.source) we deserialize our input wikitext and now we can get all the data by different methods from our PropertyDocument object.

In this line for ((lang, value) <- page.getAliases) you can see that from our deserialized page we can get aliases.

To save the triple you need to create a quad object where the 2nd parameter will be the dataset where you want to save the triple, the 3rd will be the subject, the 4th - predicate, and the 5th - value. Let's see an example: quads += new Quad(dbpedia_lang, DBpediaDatasets.WikidataProperty, subjectUri, aliasProperty, alias, document.wikiPage.sourceIri, context.ontology.datatypes("rdf:langString"))

Here we have:

dataset - DBpediaDatasets.WikidataProperty, 
subject - subjectUri, 
predicate - aliasProperty,
object - alias

Useful links

  1. WikidataToolkit JsonDeserializer: https://github.com/Wikidata/Wikidata-Toolkit/blob/9d17dd7e1b199189145088e5220ed6f05d9e31e5/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/helpers/JsonDeserializer.java
  2. Wikidata entities data model: https://www.mediawiki.org/wiki/Wikibase/DataModel

Clone this wiki locally