This code is an implementation of data augmentation for natural language processing tasks. The data augmentation is an expanding training data method, which generates pseudo-sentences from supervised sentences.
This code is depend on the following.
- python>=3.6.5
git clone https://github.com/tkmaroon/data-augmentation-for-nlp.git
cd repository
pip install -r requirements.txtYou can choose a data augmentation strategy, using a combination of a sampling strategy --sampling-strategy and a generation strategy --augmentation-strategy.
This option decides how to sample token's positions in original sentence pairs.
| storategy | description |
|---|---|
| random | randomly sample tokens. |
| - | - |
| storategy | description |
|---|---|
| dropout | Drop a token [1, 2]; |
| blank | Replace a token with a placeholder token [3]; |
| unigram | Replace a token with a sample from the unigram frequency distribution over the vocabulary [3]. Please set the option --unigram-frequency-for-generation. [3]; |
| bigramkn | Bigram Kneser-Ney smoothing [3]. Please set the option --bigram-frequency-for-generation. |
| wordnet | Replace a token with a synonym of wordnet. Please set the option --lang-for-wordnet. |
| ppdb | Replace a token with a paraphrase by given paraprase database. Please set the option --ppdb-file. |
| word2vec | Replace a token with a token which has similar vector of word2vec. Please set the option --w2v-file. |
| bert | Replace a token using output probability of BERT mask token prediction. Please set the option --model-name-or-path. It must be in the shortcut name list of hugging face's pytorch-transformers. Note that the option --vocab-file must be same the vocabulary file of a BERT tokenizer. |
| - | - |
python generate.py \
--input ./data/sample/sample.txt \
--augmentation-strategy bert \
--model-name-or-path bert-base-multilingual-uncased \
--temparature 1.0