Petro NLP is a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. The paper "Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry." describe all resources.
This repository has the NLP resources built with public information and can be openly shared. We built some of the datasets, corpora, and knowledge graphs with corporate data; they are described in the paper, but we cannot publish them.
The available resources are:
Here, you find direct links for the corpora. More information and other corpora are available on the Petrolês website.
- Petrolês - Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports).
- PetroGold - Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus.
- PetroNER - Gold standard corpus annotated with named entities in the oil & gas domain. It was built from a set of 11 Technical Reports from Petrobras, which are part of the Petrolês corpus, and were preprocessed in full and morphosyntactically.
- PetroRE - The relation in PetroRE came from corporate lists; consequently, it cannot be published here.
- Petro KGraph Ontology - It is the geological ontology used in this work. We borrowed most of the classes and relations from BFO and GeoCore and created some definitions specifically for the Petro KGraph.
- Petro KGraph - We built the Petro KGraph from corporate lists that cannot be openly shared. For academic purposes, we built a public version using entities and relations lists from the Brazilian National Agency for Petroleum, Natural Gas, and Biofuels (ANP). It is a smaller knowledge graph, but we follow the same procedure described in the paper.
- PetroVec - Models trained in Word2vec vectors with 100 dimensions, trained from public resources related to the O&G domain (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports). More models are available on the Petrolês website.
- PetroOntoVec - PetroOntoVec was trained from the corporate version of the PetroKGraph; consequently, it cannot be published here.
