PT2020 Transcription project.
In this repository, we explore different strategies for automatic transcription enrichment for ASR data which includes tasks such as automatic capitalization (truecasing) and punctuation recovery.
- Multilingual Simultaneous Sentence End and Punctuation Prediction
- Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
- Automatic truecasing of video subtitles using BERT: a multilingual adaptable approach
To replicate our winning submission to SEPP 2021 please go to the shared-task branch.
This project uses Python >3.6
Create a virtual env with (outside the project folder):
virtualenv -p python3.6 caption-envActivate venv:
source caption-env/bin/activateFinally, run:
python setup.py installIf you wish to make changes into the code run:
pip install -r requirements.txt
pip install -e .python caption train -f {your_config_file}.yamlpython caption test \
--checkpoint=some/path/to/your/checkpoint.ckpt \
--test_csv=path/to/your/testset.csvLaunch tensorboard with:
tensorboard --logdir="experiments/lightning_logs/"If you are running experiments in a remote server you can forward your localhost to the server localhost..
In order to run the toolkit tests you must run the following command:
cd tests
python -m unittestTo make sure all the code follows the same style we use Black.
