Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
c03cd94
added removing broken information parentheses e.g (; born November 20…
jlareck May 23, 2021
eaecc68
fixed removing brackets in abstracts
jlareck May 24, 2021
110616a
added removing broken paranthesis
jlareck May 27, 2021
fa4ba54
removed normalized tags
jlareck May 27, 2021
e5f5c5a
implemented shacl test for abstracts
jlareck May 27, 2021
d0b9a27
made choosing the removing of brackets in properties file
jlareck May 27, 2021
33c43b5
Merge pull request #698 from jlareck/dbpedia-abstracts
Vehnem Jun 1, 2021
3135600
Github action minidumpdoc update
Vehnem Jun 1, 2021
f88669a
Update dbp_abstract.ttl
kurzum Jun 7, 2021
5dcc0a4
Implement construct validation tests selection (#704)
jlareck Jun 15, 2021
c2843c6
Fix abstract extraction (#705)
jlareck Jun 24, 2021
d81fe9e
Implement handling of right and left validators (#706)
jlareck Jun 28, 2021
078e419
implement unit test for removing bracket method and fixed construct v…
jlareck Jul 8, 2021
6f2c149
add more unit tests for removing broken brackets function
jlareck Jul 12, 2021
9ffe0ef
move remove brackets function to AbstractUtils
jlareck Jul 13, 2021
3b9822a
move AbstractUtils to abstracts package and implement test for issue …
jlareck Aug 16, 2021
570e0d9
implement construct validation test for issue #617
jlareck Aug 24, 2021
41678fd
fix build error and create a test for issue #598
jlareck Sep 2, 2021
22d089e
Fix merging multiple infoboxes (#710)
mubashar1199 Sep 17, 2021
7464414
Github action minidumpdoc update
mubashar1199 Sep 17, 2021
b88ab6b
Fixing broken links in README (#772)
tech0priyanshu Feb 19, 2025
df0b449
Fix #804: Support multiple template namespace prefixes for Macedonian…
DhanashreePetare Jan 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 17 additions & 40 deletions .github/ISSUE_TEMPLATE/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,55 +7,32 @@ assignees: ''

---

# Issue still valid?
> DBpedia updates frequently in this order: 1. DIEF software (extracts data from wikidata), 2. monthly dumps, 3. online services loaded from dumps.
> We update http://dief.tools.dbpedia.org/server/extraction/ on a daily basis from the git and it reflects the current state.
>
> **Disclaimer:** The public SPARQL endpoints (e.g., http://dbpedia.org/sparql) and other applications build based on DBpedia's data are not in sync yet with the latest monthly extracted data.
>
> Therefore, you can use this tool to extract an example page and check if the error persists in the latest software version, and add the link you used for verification, e.g., http://dief.tools.dbpedia.org/server/extraction/en/extract?title=United+States
# Issue validity
> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/
> If the issue persists, please post the link from your browser here:

# Source
> Where did you find the data issue? Pick one, remove the others.

### Web / SPARQL
> State the service (e.g. http://dbpedia.org/sparql) and the SPARQL query
> give a link to the web / linked data pages (e.g. http://dbpedia.org/resource/Berlin)

### Release Dumps
> DBpedia provides monthly release dumps, cf. release-dashboard.dbpedia.org
> provide artifact & version or download link

### Running the DBpedia Extraction (DIEF) software
> Please include all necessary information.


# Classification
> If you have some familiarity with DBpedia, please use the classification tags at (link) to correctly file this issue. Otherwise skip this step.



### Error Description
# Error Description
> Please state the nature of your technical emergency:

# Pinpointing the source of the error
> Where did you find the data issue? Non-exhaustive options are:
* Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please **provide query or link**
* Dumps: dumps are managed by the Databus. Please **provide artifact & version or download link**
* DIEF: you ran the software and the error occured then, please **include all necessary information such as the extractor or log**. If you had problems running the software use [another issue template](https://github.com/dbpedia/extraction-framework/issues/new/choose)

### Error specification
> Pick the appropriate:
# Details
> please post the details

- Affected extraction artifacts (Databus artifact version or file identifiers):
- https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/mappingbased-objects_lang=en_disjointDomain.ttl.bz2
-
- Example DBpedia resource URL(s) having the error (one full IRI per line):
- http://dbpedia.org/resource/Leipzig
-
- Erroneous triples RDF snippet (NTRIPLES):
> Wrong triples RDF snippet
```

```
- Expected / corrected RDF outcome snippet (NTRIPLES):
> Expected / corrected RDF outcome snippet
```

```
>Example DBpedia resource URL(s)
```

### Additional context
> Add any other context about the problem here.
```
> Other
Comment on lines +10 to +38
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve markdownlint issues: bare URLs + missing code‑fence languages.

Line 11 and Line 19 include bare URLs, and the fenced blocks at Lines 27/31/35 are missing language identifiers. There’s also a blank line inside a blockquote (Line 25). These are easy markdownlint failures to fix.

✅ Suggested cleanup
-# Issue validity
-> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/  we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: http://dief.tools.dbpedia.org/server/extraction/en/
+# Issue validity
+> Some explanation: DBpedia Snapshot is produced every three months, see [Release Frequency & Schedule](https://www.dbpedia.org/blog/snapshot-2021-06-release/#anchor1), which is loaded into <http://dbpedia.org/sparql>. During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At <http://dief.tools.dbpedia.org/server/extraction/en/> we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. `Berlin` or `Joe_Biden` here: <http://dief.tools.dbpedia.org/server/extraction/en/>
@@
-> Please state the nature of your technical emergency: 
+> Please state the nature of your technical emergency:
@@
-# Details
-> please post the details
+# Details
+> please post the details
+>
@@
-> Wrong triples RDF snippet 
-  ``` 
+> Wrong triples RDF snippet 
+  ```ttl
@@
-> Expected / corrected RDF outcome snippet 
-  ``` 
+> Expected / corrected RDF outcome snippet 
+  ```ttl
@@
->Example DBpedia resource URL(s)
-```
+>Example DBpedia resource URL(s)
+```text
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

11-11: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)


19-19: Bare URL used

(MD034, no-bare-urls)


19-19: Bare URL used

(MD034, no-bare-urls)


25-25: Blank line inside blockquote

(MD028, no-blanks-blockquote)


27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


31-31: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


35-35: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In @.github/ISSUE_TEMPLATE/data.md around lines 10 - 38, Fix the markdownlint
issues in the template sections (headers "Issue validity", "Error Description",
"Pinpointing the source of the error", "Details") by replacing bare URLs with
proper link syntax (e.g., [text](http://...)) for the links currently on the
"Issue validity" and "Pinpointing the source of the error" lines, remove the
stray blank line inside the blockquote in the "Details" section, and add
language identifiers to the fenced code blocks under "Wrong triples RDF snippet"
and "Expected / corrected RDF outcome snippet" (use ttl) and to the "Example
DBpedia resource URL(s)" block (use text) so the fences become ```ttl and
```text respectively. Ensure the blockquote prefixes (>) remain consistent for
each paragraph line.

26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
**Homepage**: http://dbpedia.org <br/>
**Documentation**: http://dev.dbpedia.org/Extraction <br/>
**Get in touch with DBpedia**: https://wiki.dbpedia.org/join/get-in-touch <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace](https://dbpedia-slack.herokuapp.com/) - the main point for [developement updates](https://github.com/dbpedia/extraction-framework/blob/master/.github/workflows/maven.yml) and discussions <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typos and casing in the Slack line.

Small wording issues in user-facing docs.

✏️ Proposed fix
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/>
🧰 Tools
🪛 LanguageTool

[uncategorized] ~7-~7: Did you mean the communication tool “Slack” (= proper noun, capitalized)?
Context: ...//dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack w...

(ON_SKYPE)


[grammar] ~7-~7: Ensure spelling is correct
Context: ...oeWBlolJb_lKifEqA) - the main point for developement updates and discussions
## Conte...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In `@README.md` at line 7, Fix the Slack line in the README: remove the duplicate
"the", correct "developement" to "development", fix casing/punctuation for
clarity, and remove the stray space inside the Slack invite URL so the link
works; keep the channel mention "#dev-team" and the DBpedia Slack workspace link
but rewrite the sentence to read smoothly (e.g., "Slack: join the `#dev-team`
channel in the DBpedia Slack workspace — the main point for development updates
and discussions").

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typos and capitalization in the Slack sentence.
Line 7 has a duplicated “the” and “developement” typo.

✅ Suggested edit
-**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
+**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack workspace]( https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for developement updates and discussions <br/>
**Slack**: join the [**#dev-team**](https://dbpedia.slack.com/archives/C0L9MJFU7) Slack channel within the [DBpedia Slack workspace](https://join.slack.com/t/dbpedia/shared_invite/zt-nffbn1ra-dRoi8oeWBlolJb_lKifEqA) - the main point for development updates and discussions <br/>
🧰 Tools
🪛 LanguageTool

[uncategorized] ~7-~7: Did you mean the communication tool “Slack” (= proper noun, capitalized)?
Context: ...//dbpedia.slack.com/archives/C0L9MJFU7) slack channel within the the [DBpedia Slack w...

(ON_SKYPE)


[grammar] ~7-~7: Ensure spelling is correct
Context: ...oeWBlolJb_lKifEqA) - the main point for developement updates and discussions
## Conte...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In `@README.md` at line 7, Fix the typos in the Slack sentence that starts
"**Slack**: join the [**#dev-team**]...": remove the duplicated "the", correct
"developement" to "development", ensure "Slack workspace" capitalization is
consistent, and remove the extra space before the URL so the sentence reads
cleanly and without double words or misspellings.


## Contents

Expand Down Expand Up @@ -61,7 +61,7 @@ The DBpedia extraction framework is structured into different modules

### Core Module

![http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png](http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png "http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png")
![Data flow](https://web.archive.org/web/20111109084216/http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png)

<a name="p27582-10"></a>

Expand All @@ -76,9 +76,9 @@ The DBpedia extraction framework is structured into different modules

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace [org.dbpedia.extraction.ontology](tree/master/core/src/main/scala/org/dbpedia/extraction/ontology)
* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace [org.dbpedia.extraction.dataparser](tree/master/core/src/main/scala/org/dbpedia/extraction/dataparser)
* **Util** Various utility classes. All classes are located in the namespace [org.dbpedia.extraction.util](tree/master/core/src/main/scala/org/dbpedia/extraction/util)
* **Ontology** Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace `org.dbpedia.extraction.ontology`.
* **DataParser** Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace `org.dbpedia.extraction.dataparser`.
* **Util** Various utility classes. All classes are located in the namespace `org.dbpedia.extraction.util`.

<a name="dump-extraction-module"></a>
### Dump extraction Module
Expand All @@ -104,25 +104,25 @@ Please make sure you have read the Developer's Certificate of Origin, further do
8. Send a pull request from your branch into `extraction-framework/dev` via GitHub.
* In the description, reference the associated commit (for example, _"Fixes #123 by ..."_ for issue number 123).
* Your changes will be reviewed and discussed on GitHub.
* In addition, [Travis-CI](http://about.travis-ci.org/) will test if the merged version passes the build.
* In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
* If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
* When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.

Please keep in mind:
- Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
- If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command.
[_"Don’t rebase branches you have shared with another developer."_](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)
[_"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."_](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)
[_"Never ever rebase a branch that you pushed, or that you pulled from another person_"](http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)
- _"[Don’t rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_
- _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_
- _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_
- In general, we prefer Scala over Java.

More tips:
- Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse).
- Get help with the [Maven build](Build-from-Source-with-Maven) or another form of [installation](Installation).
- [Download](Downloads) some data to work with.
- How to run [from Scala/Java](Run-from-Java-or-Scala) or [from a JAR](Run-from-a-JAR).
- Having different troubles? Check the [troubleshooting page](Troubleshooting) or post on https://forum.dbpedia.org.
- Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html).
- [Download](https://dumps.wikimedia.org/) some data to work with.
- How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html).
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
Comment on lines +107 to +125
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve markdownlint warnings in the guidelines section.

Nested list indentation, emphasis style, and bare URL can be made lint-friendly.

✏️ Proposed fix
-  * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
-  * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
-  * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
+    * In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
+    * If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
+    * When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
@@
-- Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
-- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
+- Try _not_ to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
+- __Never__ rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, _if and only if_ nobody already pulled from the development branch!
@@
-- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
+- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on [https://forum.dbpedia.org](https://forum.dbpedia.org).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
* If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
* When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
Please keep in mind:
- Try *not* to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
- **Never** rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, *if and only if* nobody already pulled from the development branch!
- If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command.
[_"Don’t rebase branches you have shared with another developer."_](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)
[_"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."_](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)
[_"Never ever rebase a branch that you pushed, or that you pulled from another person_"](http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)
- _"[Don’t rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_
- _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_
- _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_
- In general, we prefer Scala over Java.
More tips:
- Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse).
- Get help with the [Maven build](Build-from-Source-with-Maven) or another form of [installation](Installation).
- [Download](Downloads) some data to work with.
- How to run [from Scala/Java](Run-from-Java-or-Scala) or [from a JAR](Run-from-a-JAR).
- Having different troubles? Check the [troubleshooting page](Troubleshooting) or post on https://forum.dbpedia.org.
- Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html).
- [Download](https://dumps.wikimedia.org/) some data to work with.
- How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html).
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on https://forum.dbpedia.org.
* In addition, [Travis-CI](https://www.travis-ci.com/about-us/) will test if the merged version passes the build.
* If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
* When everything is fine, your changes will be merged into `extraction-framework/dev`, finally the `dev` together with your improvements will be merged with the `master` branch.
Please keep in mind:
- Try _not_ to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
- __Never__ rebase the master onto a development branch (i.e. _never_ call `rebase` from `extraction-framework/master`). Only rebase your branch onto the dev branch, _if and only if_ nobody already pulled from the development branch!
- If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying _"To prevent you from losing history, non-fast-forward updates were rejected"_. If _(and only if)_ you are sure that nobody already pulled from this branch, add `--force` to the push command.
- _"[Don't rebase branches you have shared with another developer.](http://www.jarrodspillers.com/2009/08/19/git-merge-vs-git-rebase-avoiding-rebase-hell/)"_
- _"[Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed.](http://jeffkreeftmeijer.com/2010/the-magical-and-not-harmful-rebase/#comment-87479247)"_
- _"[Never ever rebase a branch that you pushed, or that you pulled from another person](https://web.archive.org/web/20150622064245/http://blog.experimentalworks.net/2009/03/merge-vs-rebase-a-deep-dive-into-the-mysteries-of-revision-control/)"_
- In general, we prefer Scala over Java.
More tips:
- Guides to setup your development environment for [Intellij](Setting up IntelliJ IDEA) or [Eclipse](Setting up eclipse).
- Get help with the [Maven build](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html) or another form of [installation](https://maven.apache.org/install.html).
- [Download](https://dumps.wikimedia.org/) some data to work with.
- How to run [from Scala/Java](https://docs.scala-lang.org/tutorials/scala-with-maven.html) or [from a JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/run.html).
- Having different troubles? Check the [troubleshooting page](https://maven.apache.org/users/getting-help.html) or post on [https://forum.dbpedia.org](https://forum.dbpedia.org).
🧰 Tools
🪛 LanguageTool

[style] ~108-~108: Consider an alternative to strengthen your wording.
Context: ...sion passes the build. * If there are further changes you need to make, because Travis said t...

(CHANGES_ADJUSTMENTS)


[style] ~116-~116: Consider using a more formal and expressive alternative to ‘awesome’.
Context: ...oiding-rebase-hell/)"_ - _"[Rebase is awesome, I use rebase exclusively for everythin...

(AWESOME)

🪛 markdownlint-cli2 (0.18.1)

107-107: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


108-108: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


109-109: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


112-112: Emphasis style
Expected: underscore; Actual: asterisk

(MD049, emphasis-style)


112-112: Emphasis style
Expected: underscore; Actual: asterisk

(MD049, emphasis-style)


113-113: Emphasis style
Expected: underscore; Actual: asterisk

(MD049, emphasis-style)


113-113: Emphasis style
Expected: underscore; Actual: asterisk

(MD049, emphasis-style)


125-125: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In `@README.md` around lines 107 - 125, Fix markdownlint warnings in the
guidelines section by correcting nested list indentation (indent the sub-bullets
under "If you already pushed a branch..." and the quoted links block to align as
nested list items), normalize emphasis style (replace mixed *italic* with either
underscores or consistent asterisks throughout occurrences like "Try *not* to
modify the indentation." and the _italic_ examples), and replace bare URLs with
proper Markdown links (turn the bare https://forum.dbpedia.org and any other
bare URLs into [forum.dbpedia.org](https://forum.dbpedia.org) or descriptive
link text). Locate and update the paragraph and bulleted area containing the
lines starting "If you already pushed a branch..." and the "Please keep in
mind:" block to apply these changes.


### Important: Developer's Certificate of Origin
By sending a pull request to the [extraction-framework repository](https://github.com/dbpedia/extraction-framework) on GitHub, you implicitly accept the [Developer's Certificate of Origin 1.1](https://github.com/dbpedia/extraction-framework/blob/master/documentation/DeveloperCertificateOfOrigin.md)
Expand Down
2 changes: 1 addition & 1 deletion core/doc/HowTo-release-DBpedia.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ release. It might not be complete. Please also consult with the others!
- Commit the files to the hg repository
- Don't change the files anymore. The whole extraction should use the same version.

- for AbstractExtractor: insert Wikipedia dumps into a local MySQL database using ...dump.sql.Import.scala
- for PlainAbstractExtractor: insert Wikipedia dumps into a local MySQL database using ...dump.sql.Import.scala
- adjust the LocalSettings.php of mw-modified: specify username+password for the database and the database prefix
TODO: more in-depth explanations about abstract extraction

Expand Down
20 changes: 12 additions & 8 deletions core/src/main/scala/org/dbpedia/extraction/config/Config.scala
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,8 @@ class Config(val configPath: String) extends
shortAbstractsProperty = this.getProperty("short-abstracts-property", "rdfs:comment").trim,
longAbstractsProperty = this.getProperty("long-abstracts-property", "abstract").trim,
shortAbstractMinLength = this.getProperty("short-abstract-min-length", "200").trim.toInt,
abstractTags = this.getProperty("abstract-tags", "query,pages,page,extract").trim
abstractTags = this.getProperty("abstract-tags", "query,pages,page,extract").trim,
removeBrokenBracketsProperty = this.getProperty("remove-broken-brackets-plain-abstracts", "false").trim.toBoolean
)
} match{
case Success(s) => s
Expand All @@ -293,7 +294,8 @@ class Config(val configPath: String) extends
writeAnchor = this.getProperty("nif-write-anchor", "false").trim.toBoolean,
writeLinkAnchor = this.getProperty("nif-write-link-anchor", "true").trim.toBoolean,
abstractsOnly = this.getProperty("nif-extract-abstract-only", "true").trim.toBoolean,
cssSelectorMap = this.getClass.getClassLoader.getResource("nifextractionconfig.json") //static config file in core/src/main/resources
cssSelectorMap = this.getClass.getClassLoader.getResource("nifextractionconfig.json"), //static config file in core/src/main/resources
removeBrokenBracketsProperty = this.getProperty("remove-broken-brackets-html-abstracts", "false").trim.toBoolean
)
} match{
case Success(s) => s
Expand Down Expand Up @@ -348,7 +350,8 @@ object Config{
writeAnchor: Boolean,
writeLinkAnchor: Boolean,
abstractsOnly: Boolean,
cssSelectorMap: URL
cssSelectorMap: URL,
removeBrokenBracketsProperty: Boolean
)

/**
Expand All @@ -369,11 +372,12 @@ object Config{
)

case class AbstractParameters(
abstractQuery: String,
shortAbstractsProperty: String,
longAbstractsProperty: String,
shortAbstractMinLength: Int,
abstractTags: String
abstractQuery: String,
shortAbstractsProperty: String,
longAbstractsProperty: String,
shortAbstractMinLength: Int,
abstractTags: String,
removeBrokenBracketsProperty: Boolean
)

case class SlackCredentials(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import scala.language.reflectiveCalls
* Created: 5/19/14 9:21 AM
*/

class AbstractExtractorWikipedia(
class HtmlAbstractExtractor(
context : {
def ontology : Ontology
def language : Language
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ extends PageNodeExtractor

private val language = context.language.wikiCode

private val logger = Logger.getLogger(classOf[AbstractExtractor].getName)
private val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName)

//private val apiParametersFormat = "uselang="+language+"&format=xml&action=parse&prop=text&title=%s&text=%s"
private val apiParametersFormat = "uselang="+language+"&format=xml&action=query&prop=extracts&exintro=&explaintext=&titles=%s"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import scala.language.reflectiveCalls
/**
* Extracts page html.
*
* Based on AbstractExtractor, major difference is the parameter
* Based on PlainAbstractExtractor, major difference is the parameter
* apiParametersFormat = "action=parse&prop=text&section=0&format=xml&page=%s"
*
* This class produces all nif related datasets for the abstract as well as the short-, long-abstracts datasets.
Expand Down Expand Up @@ -69,7 +69,7 @@ class NifExtractor(

object NifExtractor{
//TODO check if this function is still relevant
//copied from AbstractExtractor
//copied from PlainAbstractExtractor
def postProcessExtractedHtml(pageTitle: WikiTitle, text: String): String =
{
val startsWithLowercase =
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
package org.dbpedia.extraction.mappings

import java.util.logging.Logger

import org.dbpedia.extraction.annotations.ExtractorAnnotation
import org.dbpedia.extraction.config.Config
import org.dbpedia.extraction.config.provenance.DBpediaDatasets
import org.dbpedia.extraction.ontology.Ontology
import org.dbpedia.extraction.transform.{Quad, QuadBuilder}
import org.dbpedia.extraction.util.{Language, MediaWikiConnector}
import org.dbpedia.extraction.util.abstracts.AbstractUtils
import org.dbpedia.extraction.util.{Language, MediaWikiConnector, WikiUtil}
import org.dbpedia.extraction.wikiparser._

import scala.language.reflectiveCalls
Expand All @@ -30,7 +30,7 @@ import scala.language.reflectiveCalls

@deprecated("replaced by NifExtractor.scala: which will extract the whole page content including the abstract", "2016-10")
@ExtractorAnnotation("abstract extractor")
class AbstractExtractor(
class PlainAbstractExtractor(
context : {
def ontology : Ontology
def language : Language
Expand All @@ -39,7 +39,7 @@ class AbstractExtractor(
)
extends WikiPageExtractor
{
protected val logger = Logger.getLogger(classOf[AbstractExtractor].getName)
protected val logger = Logger.getLogger(classOf[PlainAbstractExtractor].getName)
this.getClass.getClassLoader.getResource("myproperties.properties")


Expand All @@ -50,6 +50,8 @@ extends WikiPageExtractor
//private val apiParametersFormat = "uselang="+language+"&format=xml&action=parse&prop=text&title=%s&text=%s"
protected val apiParametersFormat = context.configFile.abstractParameters.abstractQuery

protected val removeBrokenBrackets = context.configFile.abstractParameters.removeBrokenBracketsProperty

// lazy so testing does not need ontology
protected lazy val shortProperty = context.ontology.properties(context.configFile.abstractParameters.shortAbstractsProperty)

Expand All @@ -63,7 +65,6 @@ extends WikiPageExtractor

private val mwConnector = new MediaWikiConnector(context.configFile.mediawikiConnection, context.configFile.abstractParameters.abstractTags.split(","))


override def extract(pageNode : WikiPage, subjectUri: String): Seq[Quad] =
{
//Only extract abstracts for pages from the Main namespace
Expand All @@ -79,16 +80,22 @@ extends WikiPageExtractor
// if(abstractWikiText == "") return Seq.empty

//Retrieve page text
val text = mwConnector.retrievePage(pageNode.title, apiParametersFormat, pageNode.isRetry) match{
case Some(t) => AbstractExtractor.postProcessExtractedHtml(pageNode.title, replacePatterns(t))
val text = mwConnector.retrievePage(pageNode.title, apiParametersFormat, pageNode.isRetry) match {
case Some(t) => PlainAbstractExtractor.postProcessExtractedHtml(pageNode.title, replacePatterns(t))
case None => return Seq.empty
}

val modifiedText = if (removeBrokenBrackets) {
AbstractUtils.removeBrokenBracketsInAbstracts(text)
} else {
text
}

//Create a short version of the abstract
val shortText = short(text)
val shortText = short(modifiedText)

//Create statements
val quadLong = longQuad(pageNode.uri, text, pageNode.sourceIri)
val quadLong = longQuad(pageNode.uri,modifiedText, pageNode.sourceIri)
val quadShort = shortQuad(pageNode.uri, shortText, pageNode.sourceIri)

if (shortText.isEmpty)
Expand Down Expand Up @@ -140,7 +147,7 @@ extends WikiPageExtractor

private def replacePatterns(abst: String): String= {
var ret = abst
for ((regex, replacement) <- AbstractExtractor.patternsToRemove) {
for ((regex, replacement) <- PlainAbstractExtractor.patternsToRemove) {
val matches = regex.pattern.matcher(ret)
if (matches.find()) {
ret = matches.replaceAll(replacement)
Expand Down Expand Up @@ -205,15 +212,15 @@ extends WikiPageExtractor
.filter(renderNode)
.map(_.toWikiText)
.mkString("").trim

// decode HTML entities - the result is plain text
decodeHtml(text)
}
*/

}

object AbstractExtractor {
object PlainAbstractExtractor {

//TODO check if this function is still relevant
def postProcessExtractedHtml(pageTitle: WikiTitle, text: String): String =
Expand Down Expand Up @@ -243,6 +250,7 @@ object AbstractExtractor {

val patternsToRemove = List(
"""<div style=[^/]*/>""".r -> " ",
"""</div>""".r -> " "
"""</div>""".r -> " ",
"""<normalized>.*<\/normalized>""".r -> ""
)
Comment on lines 251 to 255
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make <normalized> removal non‑greedy to avoid stripping unrelated text.

.* is greedy and can remove content between the first and last <normalized> tag on the same line. It also won’t match across newlines. Prefer a non‑greedy DOTALL pattern.

🔧 Proposed fix (non-greedy DOTALL)
   val patternsToRemove = List(
     """<div style=[^/]*/>""".r -> " ",
     """</div>""".r -> " ",
-    """<normalized>.*<\/normalized>""".r -> ""
+    """(?s)<normalized>.*?</normalized>""".r -> ""
   )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val patternsToRemove = List(
"""<div style=[^/]*/>""".r -> " ",
"""</div>""".r -> " "
"""</div>""".r -> " ",
"""<normalized>.*<\/normalized>""".r -> ""
)
val patternsToRemove = List(
"""<div style=[^/]*/>""".r -> " ",
"""</div>""".r -> " ",
"""(?s)<normalized>.*?</normalized>""".r -> ""
)
🤖 Prompt for AI Agents
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`
around lines 251 - 255, The regex in patternsToRemove inside
PlainAbstractExtractor.scala uses a greedy and non-DOTALL pattern for
"<normalized>.*</normalized>", which can span too much or miss multiline
content; change that entry to a non‑greedy DOTALL pattern (e.g. use a pattern
equivalent to (?s)<normalized>.*?</normalized>) so it matches the smallest block
including newlines between each pair of tags rather than greedily across the
line.

Comment on lines 251 to 255
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use non-greedy matching for <normalized> tag removal.

The regex <normalized>.*<\/normalized> uses greedy matching, which could inadvertently match across multiple <normalized> blocks on the same line, removing more content than intended.

🐛 Proposed fix
   val patternsToRemove = List(
     """<div style=[^/]*/>""".r -> " ",
     """</div>""".r -> " ",
-    """<normalized>.*<\/normalized>""".r -> ""
+    """<normalized>.*?<\/normalized>""".r -> ""
   )
🤖 Prompt for AI Agents
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`
around lines 251 - 255, In PlainAbstractExtractor.scala update the
patternsToRemove entry that removes <normalized> blocks to use non-greedy
matching so it doesn't span multiple tags; specifically change the regex in the
patternsToRemove list (the entry currently """<normalized>.*<\/normalized>""".r
-> "") to a non-greedy form (e.g. use .*? inside the same triple-quoted regex)
so each <normalized>...</normalized> is removed independently.

}
Loading