Skip to content

Petroles/PetroNLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PetroNLP

Petro NLP is a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. The paper "Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry." describe all resources.

This repository has the NLP resources built with public information and can be openly shared. We built some of the datasets, corpora, and knowledge graphs with corporate data; they are described in the paper, but we cannot publish them.

alt text

The available resources are:

Corpora

Here, you find direct links for the corpora. More information and other corpora are available on the Petrolês website.

  • Petrolês - Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports).
  • PetroGold - Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus.
  • PetroNER - Gold standard corpus annotated with named entities in the oil & gas domain. It was built from a set of 11 Technical Reports from Petrobras, which are part of the Petrolês corpus, and were preprocessed in full and morphosyntactically.
  • PetroRE - The relation in PetroRE came from corporate lists; consequently, it cannot be published here.

Knowledge Graph

  • Petro KGraph Ontology - It is the geological ontology used in this work. We borrowed most of the classes and relations from BFO and GeoCore and created some definitions specifically for the Petro KGraph.
  • Petro KGraph - We built the Petro KGraph from corporate lists that cannot be openly shared. For academic purposes, we built a public version using entities and relations lists from the Brazilian National Agency for Petroleum, Natural Gas, and Biofuels (ANP). It is a smaller knowledge graph, but we follow the same procedure described in the paper.

Embeddings Models

  • PetroVec - Models trained in Word2vec vectors with 100 dimensions, trained from public resources related to the O&G domain (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports). More models are available on the Petrolês website.
  • PetroOntoVec - PetroOntoVec was trained from the corporate version of the PetroKGraph; consequently, it cannot be published here.

About

Petro NLP is a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors