GitHub - manu-schaaf/bilm-tf: Tensorflow implementation of contextualized word representations from bi-directional language models

Pre-processing: Process the input corpus for ELMo training. Either call with --pre_process or --gen_vocab.

usage: pre_process.py [-h] [--pre_process PRE_PROCESS] [--gen_vocab GEN_VOCAB]
                      [--train_prefix TRAIN_PREFIX] [--vocab_file VOCAB_FILE]
                      [--heldout_prefix HELDOUT_PREFIX]
                      [--min_count MIN_COUNT]
optional arguments:
  -h, --help            show this help message and exit
  --pre_process PRE_PROCESS
                        The corpus to pre-process.
  --gen_vocab GEN_VOCAB
                        Only generate the vocabulary of this corpus, no
                        training data.
  --train_prefix TRAIN_PREFIX
                        Prefix for train files
  --vocab_file VOCAB_FILE
                        Vocabulary file
  --heldout_prefix HELDOUT_PREFIX
                        The path and prefix for heldout files.
  --min_count MIN_COUNT
                        The minimal count for a vocabulary item.
Example calls:
python3 bin/pre-process.py --pre_process /home/data/corpora/corpus_a \
            --train_prefix '/home/data/pre/corpus_a_training/train.corpus_a*' \
            --heldout_prefix '/home/data/pre/corpus_a_heldout/heldout.corpus_a*' \
            --vocab_file /home/data/pre/corpus_a.100.vocab \
            --min_count 100
python3 bin/pre-process.py --gen_vocab /home/data/corpora/corpus_a \
            --vocab_file /home/data/pre/corpus_a.50.vocab \
            --min_count 50
            
--pre_process: the corpus which to pre-process.
The corpus is split into 100 parts, which are saved with the train_prefix.
A heldout portion (1%) is saved in a different directory and there split into 50 parts.
--train_prefix: The file prefix for training files.
'.$slice_number'  is appended to this name. 
--heldout_prefix: The file prefix for heldout files.
'.$heldout_slice_number' is appended to this name.
The first training slice is also saved into the directory given by this prefix. 
--vocab_file: The file in which the vocabulary is stored. 
--min_count: The minimal count of a token to make it into the vocabulary.

Training: Train the biLM for ELMo embeddings on pre-processed data.

usage: train_elmo_n_gpus.py [-h] [--train_prefix TRAIN_PREFIX]
                            [--save_dir SAVE_DIR] [--vocab_file VOCAB_FILE]
                            [--n_tokens N_TOKENS] [--stats STATS]
                            [--use_gpus USE_GPUS] [--epochs EPOCHS]
                            [--batchsize BATCHSIZE]
                            [--pre_process PRE_PROCESS]
                            [--heldout_prefix HELDOUT_PREFIX]
                            [--min_count MIN_COUNT]
optional arguments:
  -h, --help            show this help message and exit
  --train_prefix TRAIN_PREFIX
                        Prefix for train files
  --save_dir SAVE_DIR   Location of checkpoint files
  --vocab_file VOCAB_FILE
                        Vocabulary file
  --n_tokens N_TOKENS   The number of tokens in the training files
  --stats STATS         Use a .stat file for input data statistics, like token
                        count.
  --use_gpus USE_GPUS   The number of gpus to use
  --epochs EPOCHS       The number of epochs to run
  --batchsize BATCHSIZE
                        The batchsize for each gpu
Example calls:
python3 bin/train_elmo_n_gpus.py \
 --train_prefix '/home/public/stoeckel/data/Leipzig40MT2010_raw_training/train.Leipzig40MT2010_raw*' \
 --vocab_file /home/public/stoeckel/data/Leipzig40MT2010_lowered.100.vocab 
 --save_dir /home/public/stoeckel/models/biLM/Leipzig40MT2010_raw/
 --n_tokens 1093717542 \

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.idea		.idea
bilm		bilm
bin		bin
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
run.sh		run.sh
run_tests_before_shell.sh		run_tests_before_shell.sh
setup.py		setup.py
usage_cached.py		usage_cached.py
usage_character.py		usage_character.py
usage_token.py		usage_token.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

License

manu-schaaf/bilm-tf

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages