dbpedia · JJ-Author · May 23, 2021 · May 24, 2021 · May 27, 2021 · May 27, 2021
diff --git a/.github/ISSUE_TEMPLATE/data.md b/.github/ISSUE_TEMPLATE/data.md
@@ -7,55 +7,32 @@ assignees: ''
 
 ---
 
-# Issue still valid?
-> DBpedia updates frequently in this order: 1. DIEF software (extracts data from wikidata), 2. monthly dumps, 3. online services loaded from dumps.
-> We update http://dief.tools.dbpedia.org/server/extraction/ on a daily basis from the git and it reflects the current state. 
-> 
-> **Disclaimer:** The public SPARQL endpoints (e.g., http://dbpedia.org/sparql) and other applications build based on DBpedia's data are not in sync yet with the latest monthly extracted data. 
->
-> Therefore, you can use this tool to extract an example page and check if the error persists in the latest software version, and add the link you used for verification, e.g., http://dief.tools.dbpedia.org/server/extraction/en/extract?title=United+States
+# Issue validity
+> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/  we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/
+> If the issue persists, please post the link from your browser here: 
 
-# Source
-> Where did you find the data issue? Pick one, remove the others.
-
-### Web / SPARQL 
-> State the service (e.g. http://dbpedia.org/sparql) and the SPARQL query  
-> give a link to the web / linked data pages (e.g. http://dbpedia.org/resource/Berlin)
-
-### Release Dumps
-> DBpedia provides monthly release dumps, cf. release-dashboard.dbpedia.org
-> provide artifact & version or download link
-
-### Running the DBpedia Extraction (DIEF) software 
-> Please include all necessary information.
-
-
-# Classification
-> If you have some familiarity with DBpedia, please use the classification tags at (link) to correctly file this issue.  Otherwise skip this step. 
-
-
-
-### Error Description
+# Error Description
 > Please state the nature of your technical emergency: 
 
+# Pinpointing the source of the error
+> Where did you find the data issue? Non-exhaustive options are:
+* Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please **provide query or link**
+* Dumps: dumps are managed by the Databus. Please **provide artifact & version or download link**
+* DIEF: you ran the software and the error occured then, please **include all necessary information such as the extractor or log**. If you had problems running the software use [another issue template](https://github.com/dbpedia/extraction-framework/issues/new/choose)
 
-### Error specification
-> Pick the appropriate:
+# Details
+> please post the details
 
-- Affected extraction artifacts (Databus artifact version or file identifiers):
-	- https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/mappingbased-objects_lang=en_disjointDomain.ttl.bz2
-	- 
-- Example DBpedia resource URL(s) having the error (one full IRI per line): 
-	- http://dbpedia.org/resource/Leipzig 
-	- 
-- Erroneous triples RDF snippet (NTRIPLES): 
+> Wrong triples RDF snippet 
   ``` 
 
   ``` 
-- Expected / corrected RDF outcome snippet (NTRIPLES): 
+> Expected / corrected RDF outcome snippet 
   ``` 
 
   ```
+>Example DBpedia resource URL(s)
+```
 
-### Additional context
-> Add any other context about the problem here.
+```
+> Other
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 **Homepage**: http://dbpedia.org <br/>
 **Documentation**: http://dev.dbpedia.org/Extraction  <br/>
 **Get in touch with DBpedia**: https://wiki.dbpedia.org/join/get-in-touch <br/>
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace](https://dbpedia-slack.herokuapp.com/) - the main point for [developement updates](https://github.com/dbpedia/extraction-framework/blob/master/.github/workflows/maven.yml) and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) — the main point for development updates and discussions <br/>
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) — the main point for development updates and discussions <br/>
 
 ## Contents
 
@@ -61,7 +61,7 @@ The DBpedia extraction framework is structured into different modules
 
 ### Core Module
 
-![http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png](http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png "http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png")
+![Data flow](https://web.archive.org/web/20111109084216/http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png)
 
 <a name="p27582-10"></a>
 
@@ -76,9 +76,9 @@ The DBpedia extraction framework is structured into different modules
 
 In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:
 
-* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace [org.dbpedia.extraction.ontology](tree/master/core/src/main/scala/org/dbpedia/extraction/ontology)
-* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace [org.dbpedia.extraction.dataparser](tree/master/core/src/main/scala/org/dbpedia/extraction/dataparser)
-* **Util** Various utility classes. All classes are located in the namespace [org.dbpedia.extraction.util](tree/master/core/src/main/scala/org/dbpedia/extraction/util)
+* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
+* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
+* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
-* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
-* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
-* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
+* **Ontology** Classes used to represent an ontology. Methods for both reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
+* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
+* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
-* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
-* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
-* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
+* **Ontology** Classes used to represent an ontology. Methods for both reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
+* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
+* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
 
 <a name="dump-extraction-module"></a>
 ### Dump extraction Module
@@ -104,25 +104,25 @@ Please make sure you have read the Developer's Certificate of Origin, further do
 8. Send a pull request from your branch into `extraction-framework/dev` via GitHub.
   * In the description, reference the associated commit (for example, _"Fixes #123 by ..."_ for issue number 123).
-  * In the description, reference the associated commit (for example, _"Fixes #123 by ..."_ for issue number 123).
+  * In the description, reference the associated issue (for example, _"Fixes #123 by ..."_ for issue number 123).
-  * In the description, reference the associated commit (for example, _"Fixes #123 by ..."_ for issue number 123).
+  * In the description, reference the associated issue (for example, _"Fixes #123 by ..."_ for issue number 123).
   * Your changes will be reviewed and discussed on GitHub.
-  * In addition, [Travis-CI](http://about.travis-ci.org/) will test if the merged version passes the build.
+  * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
-  * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
+  * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test whether the merged version passes the build.
-  * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
+  * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test whether the merged version passes the build.
   * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
   * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
 
 Please keep in mind:
 - Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
 - **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
+- **Never** rebase the master onto a development branch (i.e., _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
+- **Never** rebase the master onto a development branch (i.e., _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
 - If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command.  
-[_"Don’t rebase branches you have shared with another developer."_](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)  
-[_"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."_](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)  
-[_"Never ever rebase a branch that you pushed, or that you pulled from another person_"](http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)
+  - _"[Don’t rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_
+  - _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_
+  - _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_
 - In general, we prefer Scala over Java.
 
 More tips:
 - Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse).
-- Get help with the [Maven build](Build-from-Source-with-Maven) or another form of [installation](Installation).
-- [Download](Downloads) some data to work with.
-- How to run [from Scala/Java](Run-from-Java-or-Scala) or [from a JAR](Run-from-a-JAR).
-- Having different troubles? Check the [troubleshooting page](Troubleshooting) or post on https://forum.dbpedia.org.
+- Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html).
+- [Download](https://dumps.wikimedia.org/) some data to work with.
+- How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html).
+- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
+- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on <https://forum.dbpedia.org>.
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
+- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on <https://forum.dbpedia.org>.
 
 ### Important: Developer's Certificate of Origin
 By sending a pull request to the [extraction-framework repository](https://github.com/dbpedia/extraction-framework) on GitHub, you implicitly accept the [Developer's Certificate of Origin 1.1](https://github.com/dbpedia/extraction-framework/blob/master/documentation/DeveloperCertificateOfOrigin.md)

diff --git a/core/doc/HowTo-release-DBpedia.txt b/core/doc/HowTo-release-DBpedia.txt
@@ -22,7 +22,7 @@ release. It might not be complete. Please also consult with the others!
    - Commit the files to the hg repository
    - Don't change the files anymore. The whole extraction should use the same version.
 
- - for AbstractExtractor: insert Wikipedia dumps into a local MySQL database using ...dump.sql.Import.scala
+ - for PlainAbstractExtractor: insert Wikipedia dumps into a local MySQL database using ...dump.sql.Import.scala
    - adjust the LocalSettings.php of mw-modified: specify username+password for the database and the database prefix
    TODO: more in-depth explanations about abstract extraction
 

diff --git a/core/src/main/scala/org/dbpedia/extraction/config/Config.scala b/core/src/main/scala/org/dbpedia/extraction/config/Config.scala
@@ -277,7 +277,8 @@ class Config(val configPath: String) extends
       shortAbstractsProperty = this.getProperty("short-abstracts-property", "rdfs:comment").trim,
       longAbstractsProperty = this.getProperty("long-abstracts-property", "abstract").trim,
       shortAbstractMinLength = this.getProperty("short-abstract-min-length", "200").trim.toInt,
-      abstractTags = this.getProperty("abstract-tags", "query,pages,page,extract").trim
+      abstractTags = this.getProperty("abstract-tags", "query,pages,page,extract").trim,
+      removeBrokenBracketsProperty = this.getProperty("remove-broken-brackets-plain-abstracts", "false").trim.toBoolean
     )
   } match{
     case Success(s) => s
@@ -293,7 +294,8 @@ class Config(val configPath: String) extends
       writeAnchor = this.getProperty("nif-write-anchor", "false").trim.toBoolean,
       writeLinkAnchor = this.getProperty("nif-write-link-anchor", "true").trim.toBoolean,
       abstractsOnly = this.getProperty("nif-extract-abstract-only", "true").trim.toBoolean,
-      cssSelectorMap = this.getClass.getClassLoader.getResource("nifextractionconfig.json") //static config file in core/src/main/resources
+      cssSelectorMap = this.getClass.getClassLoader.getResource("nifextractionconfig.json"), //static config file in core/src/main/resources
+      removeBrokenBracketsProperty = this.getProperty("remove-broken-brackets-html-abstracts", "false").trim.toBoolean
     )
   } match{
     case Success(s) => s
@@ -348,7 +350,8 @@ object Config{
     writeAnchor: Boolean,
     writeLinkAnchor: Boolean,
     abstractsOnly: Boolean,
-    cssSelectorMap: URL
+    cssSelectorMap: URL,
+    removeBrokenBracketsProperty: Boolean
   )
 
   /**
@@ -369,11 +372,12 @@ object Config{
   )
 
   case class AbstractParameters(
-    abstractQuery: String,
-    shortAbstractsProperty: String,
-    longAbstractsProperty: String,
-    shortAbstractMinLength: Int,
-    abstractTags: String
+                                 abstractQuery: String,
+                                 shortAbstractsProperty: String,
+                                 longAbstractsProperty: String,
+                                 shortAbstractMinLength: Int,
+                                 abstractTags: String,
+                                 removeBrokenBracketsProperty: Boolean
   )
 
   case class SlackCredentials(

diff --git a/...mappings/AbstractExtractorWikipedia.scala → ...tion/mappings/HtmlAbstractExtractor.scala b/...mappings/AbstractExtractorWikipedia.scala → ...tion/mappings/HtmlAbstractExtractor.scala
@@ -13,7 +13,7 @@ import scala.language.reflectiveCalls
  * Created: 5/19/14 9:21 AM
  */
 
-class AbstractExtractorWikipedia(
+class HtmlAbstractExtractor(
   context : {
     def ontology : Ontology
     def language : Language

diff --git a/core/src/main/scala/org/dbpedia/extraction/mappings/MissingAbstractsExtractor.scala b/core/src/main/scala/org/dbpedia/extraction/mappings/MissingAbstractsExtractor.scala
@@ -54,7 +54,7 @@ extends PageNodeExtractor
 
     private val language = context.language.wikiCode
 
-    private val logger = Logger.getLogger(classOf[AbstractExtractor].getName)
+    private val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName)
 
     //private val apiParametersFormat = "uselang="+language+"&format=xml&action=parse&prop=text&title=%s&text=%s"
     private val apiParametersFormat = "uselang="+language+"&format=xml&action=query&prop=extracts&exintro=&explaintext=&titles=%s"

diff --git a/core/src/main/scala/org/dbpedia/extraction/mappings/NifExtractor.scala b/core/src/main/scala/org/dbpedia/extraction/mappings/NifExtractor.scala
@@ -14,7 +14,7 @@ import scala.language.reflectiveCalls
 /**
   * Extracts page html.
   *
-  * Based on AbstractExtractor, major difference is the parameter
+  * Based on PlainAbstractExtractor, major difference is the parameter
   * apiParametersFormat = "action=parse&prop=text&section=0&format=xml&page=%s"
   *
   * This class produces all nif related datasets for the abstract as well as the short-, long-abstracts datasets.
@@ -69,7 +69,7 @@ class NifExtractor(
 
 object NifExtractor{
   //TODO check if this function is still relevant
-  //copied from AbstractExtractor
+  //copied from PlainAbstractExtractor
   def postProcessExtractedHtml(pageTitle: WikiTitle, text: String): String =
   {
     val startsWithLowercase =

diff --git a/...traction/mappings/AbstractExtractor.scala → ...ion/mappings/PlainAbstractExtractor.scala b/...traction/mappings/AbstractExtractor.scala → ...ion/mappings/PlainAbstractExtractor.scala
@@ -1,13 +1,13 @@
 package org.dbpedia.extraction.mappings
 
 import java.util.logging.Logger
-
 import org.dbpedia.extraction.annotations.ExtractorAnnotation
 import org.dbpedia.extraction.config.Config
 import org.dbpedia.extraction.config.provenance.DBpediaDatasets
 import org.dbpedia.extraction.ontology.Ontology
 import org.dbpedia.extraction.transform.{Quad, QuadBuilder}
-import org.dbpedia.extraction.util.{Language, MediaWikiConnector}
+import org.dbpedia.extraction.util.abstracts.AbstractUtils
+import org.dbpedia.extraction.util.{Language, MediaWikiConnector, WikiUtil}
 import org.dbpedia.extraction.wikiparser._
 
 import scala.language.reflectiveCalls
@@ -30,7 +30,7 @@ import scala.language.reflectiveCalls
 
 @deprecated("replaced by NifExtractor.scala: which will extract the whole page content including the abstract", "2016-10")
 @ExtractorAnnotation("abstract extractor")
-class AbstractExtractor(
+class PlainAbstractExtractor(
   context : {
     def ontology : Ontology
     def language : Language
@@ -39,7 +39,7 @@ class AbstractExtractor(
 )
 extends WikiPageExtractor
 {
-  protected val logger = Logger.getLogger(classOf[AbstractExtractor].getName)
+  protected val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName)
   this.getClass.getClassLoader.getResource("myproperties.properties")
 
 
@@ -50,6 +50,8 @@ extends WikiPageExtractor
     //private val apiParametersFormat = "uselang="+language+"&format=xml&action=parse&prop=text&title=%s&text=%s"
   protected val apiParametersFormat = context.configFile.abstractParameters.abstractQuery
 
+  protected val removeBrokenBrackets = context.configFile.abstractParameters.removeBrokenBracketsProperty
+
     // lazy so testing does not need ontology
   protected lazy val shortProperty = context.ontology.properties(context.configFile.abstractParameters.shortAbstractsProperty)
 
@@ -63,7 +65,6 @@ extends WikiPageExtractor
 
   private val mwConnector = new MediaWikiConnector(context.configFile.mediawikiConnection, context.configFile.abstractParameters.abstractTags.split(","))
 
-
     override def extract(pageNode : WikiPage, subjectUri: String): Seq[Quad] =
     {
         //Only extract abstracts for pages from the Main namespace
@@ -79,16 +80,22 @@ extends WikiPageExtractor
         // if(abstractWikiText == "") return Seq.empty
 
         //Retrieve page text
-        val text = mwConnector.retrievePage(pageNode.title, apiParametersFormat, pageNode.isRetry) match{
-          case Some(t) => AbstractExtractor.postProcessExtractedHtml(pageNode.title, replacePatterns(t))
+        val text = mwConnector.retrievePage(pageNode.title, apiParametersFormat, pageNode.isRetry) match {
+          case Some(t) => PlainAbstractExtractor.postProcessExtractedHtml(pageNode.title, replacePatterns(t))
           case None => return Seq.empty
         }
 
+        val modifiedText = if (removeBrokenBrackets) {
+          AbstractUtils.removeBrokenBracketsInAbstracts(text)
+        } else {
+          text
+        }
+
         //Create a short version of the abstract
-        val shortText = short(text)
+        val shortText = short(modifiedText)
 
         //Create statements
-        val quadLong = longQuad(pageNode.uri, text, pageNode.sourceIri)
+        val quadLong = longQuad(pageNode.uri,modifiedText, pageNode.sourceIri)
         val quadShort = shortQuad(pageNode.uri, shortText, pageNode.sourceIri)
 
         if (shortText.isEmpty)
@@ -140,7 +147,7 @@ extends WikiPageExtractor
 
     private def replacePatterns(abst: String): String= {
       var ret = abst
-      for ((regex, replacement) <- AbstractExtractor.patternsToRemove) {
+      for ((regex, replacement) <- PlainAbstractExtractor.patternsToRemove) {
         val matches = regex.pattern.matcher(ret)
         if (matches.find()) {
           ret = matches.replaceAll(replacement)
@@ -205,15 +212,15 @@ extends WikiPageExtractor
                 .filter(renderNode)
                 .map(_.toWikiText)
                 .mkString("").trim
-        
+
         // decode HTML entities - the result is plain text
         decodeHtml(text)
     }
     */
 
 }
 
-object AbstractExtractor {
+object PlainAbstractExtractor {
 
   //TODO check if this function is still relevant
   def postProcessExtractedHtml(pageTitle: WikiTitle, text: String): String =
@@ -243,6 +250,7 @@ object AbstractExtractor {
 
   val patternsToRemove = List(
     """<div style=[^/]*/>""".r -> " ",
-    """</div>""".r -> " "
+    """</div>""".r -> " ",
+    """<normalized>.*<\/normalized>""".r -> ""
   )
 }