Data Augmentation for Natural Language Processing

This code is an implementation of data augmentation for natural language processing tasks. The data augmentation is an expanding training data method, which generates pseudo-sentences from supervised sentences.

Installation

This code is depend on the following.

python>=3.6.5

git clone https://github.com/tkmaroon/data-augmentation-for-nlp.git
cd repository
pip install -r requirements.txt

Data Augmentation

You can choose a data augmentation strategy, using a combination of a sampling strategy --sampling-strategy and a generation strategy --augmentation-strategy.

Sampling storategies (`--sampling-storategy`)

This option decides how to sample token's positions in original sentence pairs.

storategy	description
random	randomly sample tokens.
-	-

Generation storategies

storategy	description
dropout	Drop a token [1, 2];
blank	Replace a token with a placeholder token [3];
unigram	Replace a token with a sample from the unigram frequency distribution over the vocabulary [3]. Please set the option `--unigram-frequency-for-generation`. [3];
bigramkn	Bigram Kneser-Ney smoothing [3]. Please set the option `--bigram-frequency-for-generation`.
wordnet	Replace a token with a synonym of wordnet. Please set the option `--lang-for-wordnet`.
ppdb	Replace a token with a paraphrase by given paraprase database. Please set the option `--ppdb-file`.
word2vec	Replace a token with a token which has similar vector of word2vec. Please set the option `--w2v-file`.
bert	Replace a token using output probability of BERT mask token prediction. Please set the option `--model-name-or-path`. It must be in the shortcut name list of hugging face's pytorch-transformers. Note that the option `--vocab-file` must be same the vocabulary file of a BERT tokenizer.
-	-

Usage

python generate.py \
    --input ./data/sample/sample.txt \
    --augmentation-strategy bert \
    --model-name-or-path bert-base-multilingual-uncased \
    --temparature 1.0

References

[1] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume III. Deep unordered com- ´ position rivals syntactic methods for text classification. ACL2015, volume 1, pages 1681–1691.

[2] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.

[3] Ziang Xie, Sida I Wang, Jiwei Li, Daniel Levy, Aiming ´ Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
module		module
utils		utils
README.md		README.md
generate.py		generate.py
options.py		options.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Augmentation for Natural Language Processing

Installation

Data Augmentation

Sampling storategies (`--sampling-storategy`)

Generation storategies

Usage

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

tm4roon/data-augmentation-for-nlp

Folders and files

Latest commit

History

Repository files navigation

Data Augmentation for Natural Language Processing

Installation

Data Augmentation

Sampling storategies (--sampling-storategy)

Generation storategies

Usage

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Sampling storategies (`--sampling-storategy`)

Packages