Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
c03cd94
added removing broken information parentheses e.g (; born November 20…
jlareck May 23, 2021
eaecc68
fixed removing brackets in abstracts
jlareck May 24, 2021
110616a
added removing broken paranthesis
jlareck May 27, 2021
fa4ba54
removed normalized tags
jlareck May 27, 2021
e5f5c5a
implemented shacl test for abstracts
jlareck May 27, 2021
d0b9a27
made choosing the removing of brackets in properties file
jlareck May 27, 2021
33c43b5
Merge pull request #698 from jlareck/dbpedia-abstracts
Vehnem Jun 1, 2021
3135600
Github action minidumpdoc update
Vehnem Jun 1, 2021
f88669a
Update dbp_abstract.ttl
kurzum Jun 7, 2021
5dcc0a4
Implement construct validation tests selection (#704)
jlareck Jun 15, 2021
c2843c6
Fix abstract extraction (#705)
jlareck Jun 24, 2021
d81fe9e
Implement handling of right and left validators (#706)
jlareck Jun 28, 2021
078e419
implement unit test for removing bracket method and fixed construct v…
jlareck Jul 8, 2021
6f2c149
add more unit tests for removing broken brackets function
jlareck Jul 12, 2021
9ffe0ef
move remove brackets function to AbstractUtils
jlareck Jul 13, 2021
3b9822a
move AbstractUtils to abstracts package and implement test for issue …
jlareck Aug 16, 2021
570e0d9
implement construct validation test for issue #617
jlareck Aug 24, 2021
41678fd
fix build error and create a test for issue #598
jlareck Sep 2, 2021
22d089e
Fix merging multiple infoboxes (#710)
mubashar1199 Sep 17, 2021
7464414
Github action minidumpdoc update
mubashar1199 Sep 17, 2021
b88ab6b
Fixing broken links in README (#772)
tech0priyanshu Feb 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 17 additions & 40 deletions .github/ISSUE_TEMPLATE/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,55 +7,32 @@ assignees: ''

---

# Issue still valid?
> DBpedia updates frequently in this order: 1. DIEF software (extracts data from wikidata), 2. monthly dumps, 3. online services loaded from dumps.
> We update http://dief.tools.dbpedia.org/server/extraction/ on a daily basis from the git and it reflects the current state.
>
> **Disclaimer:** The public SPARQL endpoints (e.g., http://dbpedia.org/sparql) and other applications build based on DBpedia's data are not in sync yet with the latest monthly extracted data.
>
> Therefore, you can use this tool to extract an example page and check if the error persists in the latest software version, and add the link you used for verification, e.g., http://dief.tools.dbpedia.org/server/extraction/en/extract?title=United+States
# Issue validity
> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap bare URLs in angle brackets or Markdown links.

Multiple bare URLs appear in the template. For better Markdown compliance and clickability, wrap them in angle brackets <URL> or use Markdown link syntax [text](URL).

Based on static analysis hints.

Apply this diff:

-> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/  we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/
+> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into <http://dbpedia.org/sparql>. During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At <http://dief.tools.dbpedia.org/server/extraction/en/> we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: <http://dief.tools.dbpedia.org/server/extraction/en/>
-* Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please **provide query or link**
+* Web/SPARQL, e.g. <http://dbpedia.org/sparql> or <http://dbpedia.org/resource/Berlin>, please **provide query or link**

Also applies to: 19-19

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

11-11: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In .github/ISSUE_TEMPLATE/data.md around lines 11 and 19, there are multiple
bare URLs that should be wrapped for proper Markdown rendering; replace each
bare URL (e.g. https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1,
http://dbpedia.org/sparql, http://dief.tools.dbpedia.org/server/extraction/en/)
with either angle-bracketed form <URL> or convert to Markdown links like
[Release Frequency &
Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1) and
similarly for the SPARQL and DIEF extractor URLs so they render and are
clickable.

> If the issue persists, please post the link from your browser here:

# Source
> Where did you find the data issue? Pick one, remove the others.

### Web / SPARQL
> State the service (e.g. http://dbpedia.org/sparql) and the SPARQL query
> give a link to the web / linked data pages (e.g. http://dbpedia.org/resource/Berlin)

### Release Dumps
> DBpedia provides monthly release dumps, cf. release-dashboard.dbpedia.org
> provide artifact & version or download link

### Running the DBpedia Extraction (DIEF) software
> Please include all necessary information.


# Classification
> If you have some familiarity with DBpedia, please use the classification tags at (link) to correctly file this issue. Otherwise skip this step.



### Error Description
# Error Description
> Please state the nature of your technical emergency:

# Pinpointing the source of the error
> Where did you find the data issue? Non-exhaustive options are:
* Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please **provide query or link**
* Dumps: dumps are managed by the Databus. Please **provide artifact & version or download link**
* DIEF: you ran the software and the error occured then, please **include all necessary information such as the extractor or log**. If you had problems running the software use [another issue template](https://github.com/dbpedia/extraction-framework/issues/new/choose)

### Error specification
> Pick the appropriate:
# Details
> please post the details

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove blank line inside blockquote.

Line 25 contains a blank line within a blockquote, which breaks Markdown formatting. Remove the blank line or replace it with > to maintain the blockquote.

Based on static analysis hints.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

25-25: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🤖 Prompt for AI Agents
.github/ISSUE_TEMPLATE/data.md around line 25: there is a blank line inside a
blockquote which breaks Markdown rendering; remove the blank line (or replace it
with a '>' prefix) so the blockquote lines are contiguous and properly
formatted, ensuring the blockquote marker is present on the next line if you
want an empty quoted line.

- Affected extraction artifacts (Databus artifact version or file identifiers):
- https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/mappingbased-objects_lang=en_disjointDomain.ttl.bz2
-
- Example DBpedia resource URL(s) having the error (one full IRI per line):
- http://dbpedia.org/resource/Leipzig
-
- Erroneous triples RDF snippet (NTRIPLES):
> Wrong triples RDF snippet
```

```
- Expected / corrected RDF outcome snippet (NTRIPLES):
> Expected / corrected RDF outcome snippet
```

```
Comment on lines 27 to 33
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifiers to fenced code blocks.

Fenced code blocks should specify a language for proper syntax highlighting. For RDF/SPARQL snippets, use turtle or sparql; for generic examples, use text.

Based on static analysis hints.

Apply this diff:

 > Wrong triples RDF snippet 
-  ``` 
+  ```turtle
   

Expected / corrected RDF outcome snippet

Example DBpedia resource URL(s)
- +text

- +



Also applies to: 35-37

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

31-31: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

.github/ISSUE_TEMPLATE/data.md around lines 27-33 (and similarly lines 35-37):
fenced code blocks lack language specifiers; update the opening triple-backtick
lines to include appropriate languages (use turtle for RDF/SPARQL snippets and text for generic examples) so the blocks become turtle or text
respectively, keeping the closing ``` unchanged.


</details>

<!-- This is an auto-generated comment by CodeRabbit -->

>Example DBpedia resource URL(s)
```

### Additional context
> Add any other context about the problem here.
```
> Other
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
**Homepage**: http://dbpedia.org <br/>
**Documentation**: http://dev.dbpedia.org/Extraction <br/>
**Get in touch with DBpedia**: https://wiki.dbpedia.org/join/get-in-touch <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace](https://dbpedia-slack.herokuapp.com/) - the main point for [developement updates](https://github.com/dbpedia/extraction-framework/blob/master/.github/workflows/maven.yml) and discussions <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) the main point for development updates and discussions <br/>


## Contents

Expand Down Expand Up @@ -61,7 +61,7 @@ The DBpedia extraction framework is structured into different modules

### Core Module

![http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png](http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png "http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png")
![Data flow](https://web.archive.org/web/20111109084216/http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png)

<a name="p27582-10"></a>

Expand All @@ -76,9 +76,9 @@ The DBpedia extraction framework is structured into different modules

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace [org.dbpedia.extraction.ontology](tree/master/core/src/main/scala/org/dbpedia/extraction/ontology)
* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace [org.dbpedia.extraction.dataparser](tree/master/core/src/main/scala/org/dbpedia/extraction/dataparser)
* **Util** Various utility classes. All classes are located in the namespace [org.dbpedia.extraction.util](tree/master/core/src/main/scala/org/dbpedia/extraction/util)
* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
Comment on lines +79 to +81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.
* **Ontology** Classes used to represent an ontology. Methods for both reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.


<a name="dump-extraction-module"></a>
### Dump extraction Module
Expand All @@ -104,25 +104,25 @@ Please make sure you have read the Developer's Certificate of Origin, further do
8. Send a pull request from your branch into `extraction-framework/dev` via GitHub.
* In the description, reference the associated commit (for example, _"Fixes #123 by ..."_ for issue number 123).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In the description, reference the associated commit (for example, _"Fixes #123 by ..."_ for issue number 123).
* In the description, reference the associated issue (for example, _"Fixes #123 by ..."_ for issue number 123).

* Your changes will be reviewed and discussed on GitHub.
* In addition, [Travis-CI](http://about.travis-ci.org/) will test if the merged version passes the build.
* In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
* In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test whether the merged version passes the build.

* If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
* When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.

Please keep in mind:
- Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
- **Never** rebase the master onto a development branch (i.e., _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!

- If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command.
[_"Don’t rebase branches you have shared with another developer."_](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)
[_"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."_](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)
[_"Never ever rebase a branch that you pushed, or that you pulled from another person_"](http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)
- _"[Don’t rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_
- _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_
- _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_
- In general, we prefer Scala over Java.

More tips:
- Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse).
- Get help with the [Maven build](Build-from-Source-with-Maven) or another form of [installation](Installation).
- [Download](Downloads) some data to work with.
- How to run [from Scala/Java](Run-from-Java-or-Scala) or [from a JAR](Run-from-a-JAR).
- Having different troubles? Check the [troubleshooting page](Troubleshooting) or post on https://forum.dbpedia.org.
- Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html).
- [Download](https://dumps.wikimedia.org/) some data to work with.
- How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html).
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap bare URL in angle brackets or Markdown link.

Line 125 contains a bare URL (https://forum.dbpedia.org) that should be wrapped for better Markdown compliance.

Based on static analysis hints.

Apply this diff:

-- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
+- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on <https://forum.dbpedia.org>.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on <https://forum.dbpedia.org>.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

125-125: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In README.md around line 125, there's a bare URL (https://forum.dbpedia.org)
that needs to be wrapped for proper Markdown formatting; replace the bare URL
with either a Markdown link text like [DBpedia forum](https://forum.dbpedia.org)
or wrap it in angle brackets <https://forum.dbpedia.org> so the link is
rendered/clickable and compliant with Markdown linting.


### Important: Developer's Certificate of Origin
By sending a pull request to the [extraction-framework repository](https://github.com/dbpedia/extraction-framework) on GitHub, you implicitly accept the [Developer's Certificate of Origin 1.1](https://github.com/dbpedia/extraction-framework/blob/master/documentation/DeveloperCertificateOfOrigin.md)
Expand Down
2 changes: 1 addition & 1 deletion core/doc/HowTo-release-DBpedia.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ release. It might not be complete. Please also consult with the others!
- Commit the files to the hg repository
- Don't change the files anymore. The whole extraction should use the same version.

- for AbstractExtractor: insert Wikipedia dumps into a local MySQL database using ...dump.sql.Import.scala
- for PlainAbstractExtractor: insert Wikipedia dumps into a local MySQL database using ...dump.sql.Import.scala
- adjust the LocalSettings.php of mw-modified: specify username+password for the database and the database prefix
TODO: more in-depth explanations about abstract extraction

Expand Down
20 changes: 12 additions & 8 deletions core/src/main/scala/org/dbpedia/extraction/config/Config.scala
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,8 @@ class Config(val configPath: String) extends
shortAbstractsProperty = this.getProperty("short-abstracts-property", "rdfs:comment").trim,
longAbstractsProperty = this.getProperty("long-abstracts-property", "abstract").trim,
shortAbstractMinLength = this.getProperty("short-abstract-min-length", "200").trim.toInt,
abstractTags = this.getProperty("abstract-tags", "query,pages,page,extract").trim
abstractTags = this.getProperty("abstract-tags", "query,pages,page,extract").trim,
removeBrokenBracketsProperty = this.getProperty("remove-broken-brackets-plain-abstracts", "false").trim.toBoolean
)
} match{
case Success(s) => s
Expand All @@ -293,7 +294,8 @@ class Config(val configPath: String) extends
writeAnchor = this.getProperty("nif-write-anchor", "false").trim.toBoolean,
writeLinkAnchor = this.getProperty("nif-write-link-anchor", "true").trim.toBoolean,
abstractsOnly = this.getProperty("nif-extract-abstract-only", "true").trim.toBoolean,
cssSelectorMap = this.getClass.getClassLoader.getResource("nifextractionconfig.json") //static config file in core/src/main/resources
cssSelectorMap = this.getClass.getClassLoader.getResource("nifextractionconfig.json"), //static config file in core/src/main/resources
removeBrokenBracketsProperty = this.getProperty("remove-broken-brackets-html-abstracts", "false").trim.toBoolean
)
} match{
case Success(s) => s
Expand Down Expand Up @@ -348,7 +350,8 @@ object Config{
writeAnchor: Boolean,
writeLinkAnchor: Boolean,
abstractsOnly: Boolean,
cssSelectorMap: URL
cssSelectorMap: URL,
removeBrokenBracketsProperty: Boolean
)

/**
Expand All @@ -369,11 +372,12 @@ object Config{
)

case class AbstractParameters(
abstractQuery: String,
shortAbstractsProperty: String,
longAbstractsProperty: String,
shortAbstractMinLength: Int,
abstractTags: String
abstractQuery: String,
shortAbstractsProperty: String,
longAbstractsProperty: String,
shortAbstractMinLength: Int,
abstractTags: String,
removeBrokenBracketsProperty: Boolean
)

case class SlackCredentials(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import scala.language.reflectiveCalls
* Created: 5/19/14 9:21 AM
*/

class AbstractExtractorWikipedia(
class HtmlAbstractExtractor(
context : {
def ontology : Ontology
def language : Language
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ extends PageNodeExtractor

private val language = context.language.wikiCode

private val logger = Logger.getLogger(classOf[AbstractExtractor].getName)
private val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName)

//private val apiParametersFormat = "uselang="+language+"&format=xml&action=parse&prop=text&title=%s&text=%s"
private val apiParametersFormat = "uselang="+language+"&format=xml&action=query&prop=extracts&exintro=&explaintext=&titles=%s"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import scala.language.reflectiveCalls
/**
* Extracts page html.
*
* Based on AbstractExtractor, major difference is the parameter
* Based on PlainAbstractExtractor, major difference is the parameter
* apiParametersFormat = "action=parse&prop=text&section=0&format=xml&page=%s"
*
* This class produces all nif related datasets for the abstract as well as the short-, long-abstracts datasets.
Expand Down Expand Up @@ -69,7 +69,7 @@ class NifExtractor(

object NifExtractor{
//TODO check if this function is still relevant
//copied from AbstractExtractor
//copied from PlainAbstractExtractor
def postProcessExtractedHtml(pageTitle: WikiTitle, text: String): String =
{
val startsWithLowercase =
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
package org.dbpedia.extraction.mappings

import java.util.logging.Logger

import org.dbpedia.extraction.annotations.ExtractorAnnotation
import org.dbpedia.extraction.config.Config
import org.dbpedia.extraction.config.provenance.DBpediaDatasets
import org.dbpedia.extraction.ontology.Ontology
import org.dbpedia.extraction.transform.{Quad, QuadBuilder}
import org.dbpedia.extraction.util.{Language, MediaWikiConnector}
import org.dbpedia.extraction.util.abstracts.AbstractUtils
import org.dbpedia.extraction.util.{Language, MediaWikiConnector, WikiUtil}
import org.dbpedia.extraction.wikiparser._

import scala.language.reflectiveCalls
Expand All @@ -30,7 +30,7 @@ import scala.language.reflectiveCalls

@deprecated("replaced by NifExtractor.scala: which will extract the whole page content including the abstract", "2016-10")
@ExtractorAnnotation("abstract extractor")
class AbstractExtractor(
class PlainAbstractExtractor(
context : {
def ontology : Ontology
def language : Language
Expand All @@ -39,7 +39,7 @@ class AbstractExtractor(
)
extends WikiPageExtractor
{
protected val logger = Logger.getLogger(classOf[AbstractExtractor].getName)
protected val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName)
this.getClass.getClassLoader.getResource("myproperties.properties")


Expand All @@ -50,6 +50,8 @@ extends WikiPageExtractor
//private val apiParametersFormat = "uselang="+language+"&format=xml&action=parse&prop=text&title=%s&text=%s"
protected val apiParametersFormat = context.configFile.abstractParameters.abstractQuery

protected val removeBrokenBrackets = context.configFile.abstractParameters.removeBrokenBracketsProperty

// lazy so testing does not need ontology
protected lazy val shortProperty = context.ontology.properties(context.configFile.abstractParameters.shortAbstractsProperty)

Expand All @@ -63,7 +65,6 @@ extends WikiPageExtractor

private val mwConnector = new MediaWikiConnector(context.configFile.mediawikiConnection, context.configFile.abstractParameters.abstractTags.split(","))


override def extract(pageNode : WikiPage, subjectUri: String): Seq[Quad] =
{
//Only extract abstracts for pages from the Main namespace
Expand All @@ -79,16 +80,22 @@ extends WikiPageExtractor
// if(abstractWikiText == "") return Seq.empty

//Retrieve page text
val text = mwConnector.retrievePage(pageNode.title, apiParametersFormat, pageNode.isRetry) match{
case Some(t) => AbstractExtractor.postProcessExtractedHtml(pageNode.title, replacePatterns(t))
val text = mwConnector.retrievePage(pageNode.title, apiParametersFormat, pageNode.isRetry) match {
case Some(t) => PlainAbstractExtractor.postProcessExtractedHtml(pageNode.title, replacePatterns(t))
case None => return Seq.empty
}

val modifiedText = if (removeBrokenBrackets) {
AbstractUtils.removeBrokenBracketsInAbstracts(text)
} else {
text
}

//Create a short version of the abstract
val shortText = short(text)
val shortText = short(modifiedText)

//Create statements
val quadLong = longQuad(pageNode.uri, text, pageNode.sourceIri)
val quadLong = longQuad(pageNode.uri,modifiedText, pageNode.sourceIri)
val quadShort = shortQuad(pageNode.uri, shortText, pageNode.sourceIri)

if (shortText.isEmpty)
Expand Down Expand Up @@ -140,7 +147,7 @@ extends WikiPageExtractor

private def replacePatterns(abst: String): String= {
var ret = abst
for ((regex, replacement) <- AbstractExtractor.patternsToRemove) {
for ((regex, replacement) <- PlainAbstractExtractor.patternsToRemove) {
val matches = regex.pattern.matcher(ret)
if (matches.find()) {
ret = matches.replaceAll(replacement)
Expand Down Expand Up @@ -205,15 +212,15 @@ extends WikiPageExtractor
.filter(renderNode)
.map(_.toWikiText)
.mkString("").trim

// decode HTML entities - the result is plain text
decodeHtml(text)
}
*/

}

object AbstractExtractor {
object PlainAbstractExtractor {

//TODO check if this function is still relevant
def postProcessExtractedHtml(pageTitle: WikiTitle, text: String): String =
Expand Down Expand Up @@ -243,6 +250,7 @@ object AbstractExtractor {

val patternsToRemove = List(
"""<div style=[^/]*/>""".r -> " ",
"""</div>""".r -> " "
"""</div>""".r -> " ",
"""<normalized>.*<\/normalized>""".r -> ""
)
}
Loading
Loading