GitHub - shaunrong/RParser: An NLP parser to extract structured information from Materials Science Literature

##Code Implementation

Run python setup.py develop after the download the repo, more details about how to use the codes can be found at the project report.

###Architecture

->![](RParse.jpeg =700x)<-

The code architecture is shown in the figure above. The original text is firstly digested by a PreProcess.py module. The existence of Preprocess module is to substitute some special words in the text which are hard to work on for the RParse.py module. For example, all chemical formulas such as Na3MnPO4CO3 is substituted by word COMPOUND. The PreProcess.py module will then the processed text and create a substitute table, in which all such Na3MnPO4CO3 $$$\rightarrow$$$ COMPOUND substitution is recorded. The processed text is passed to RParse.py model, which extracts domain specific substrings, these substrings still contain substituted words such as COMPOUND etc. They are passed together with the sub table to PostProcess.py which will recover the original words from substituted ones. These final results are written in the *.RParse.yaml files.

###Codes

####PreProcess.py

Several main problems solved by PreProcess.py module is

encoding problems: some symbols are not recognizable by normal utf-8 encoding, we resolve this issue in PreProcess module by unicode coding
chemical formula: chemical formulas are hard to run with parsing algorithms, as one single chemical formula will be parsed into multiple word tokens, such as the chemical formula blow. PreProcessor will turn these words into a single word COMPOUND.

$$Na3V2(PO4)2F3 \rightarrow [Na3V2, (, PO4, ), 2F3]$$

Units: some units are hard to run with parsing algorithms, one single unit word gets parsed into multiple tokens, such as the example below. PreProcessor will turn these words into a single word UNIT.

$$mAh/g \rightarrow [mAh, /, g]$$

Number Range: some number ranges are hard to run with parsing algorithms, one single unit word gets parsed into multiple tokens, such as the example below. PreProcessor will turn these words into a single word NUMBER.

$$4-72 \rightarrow [4, -, 72]$$

etc. There are multiple other things PreProcessor.py takes care of. PreProcessor.py mainly uses heuristic rules for detecting these special words and substitute them correspondingly.

####RParse.py

RParse has two key functions:

parse_v_method returns key action verb groups and associated method arguments using heuristic rules
parse_input_output returns noun phrases tagged as input_output

####PostProcess.py

recover the original substrings using substitute table from the output of PreProcess.py.
recover some special symbol and comma, periods position.
- To parse the original sentence correctly, PreProcessor.py will always put an extra space between word and comma or period, PostProcessor.py recovers this change
- In parsing, '(' will be notified as special '-LRB-', and ')' will be notified as special '-RRB-', PostProcessor.py recovers this change
etc.

###Evaluation Scripts Evaluation scripts are written to compare all of the *.RParse.yaml files with corresponding *.gold.yaml annotations to evaluate recall, precision, F1 scores of each specific domains.

evaluate_action_verbs.py

Takes a file folder as input and compare all matched *.RParse.yaml and *.gold.yaml files under that file folder to evaluate the effectiveness of extracting key actions verbs.
example: python evaluate_action_verbs.py -f ../data/verb_method_arg/test

evaluate_method_arg.py

Similar to evaluate_action_verbs.py, but evaluate the effectiveness of extracting associated method arguments.
example: python evaluate_method_arg.py -f ../data/verb_method_arg/train
etc.

###Dependencies This code repo replies heavily on the standford NLP codes, which is interfaced within NLTK package. The way to configure standford codes

download it from link
Create a new folder (stanford_parser_folder in my environ.yaml file). Place the extracted files into this jar folder: stanford-parser-3.x.x-models.jar and stanford-parser.jar.
Open the stanford-parser-3.x.x-models.jar using command jar xf jar-file [archived-file(s)]
Browse inside the jar file; edu/stanford/nlp/models/lexparser. Again, extract the file called 'englishPCFG.ser.gz'.
Copy 'englishPCFG.ser.gz' into a different folder and copy its path to model_path in environ.yaml file.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
RParser.egg-info		RParser.egg-info
data		data
evaluation_scripts		evaluation_scripts
misc		misc
scratch		scratch
PostProcessor.py		PostProcessor.py
PostProcessor.pyc		PostProcessor.pyc
PreProcessor.py		PreProcessor.py
PreProcessor.pyc		PreProcessor.pyc
README.md		README.md
RParse.jpeg		RParse.jpeg
RParser.py		RParser.py
RParser.pyc		RParser.pyc
__init__.py		__init__.py
construct_ioput_train.py		construct_ioput_train.py
environ.yaml		environ.yaml
ioput_maxent_classifier.pickle		ioput_maxent_classifier.pickle
ioput_train_db.yaml		ioput_train_db.yaml
setup.py		setup.py
test.ipynb		test.ipynb
train_ioput_maxent_unigram.py		train_ioput_maxent_unigram.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages