There are three main steps in adding your data to OmniLingo. The first step is importing the data into IPFS, the second is indexing the data and the final step is publishing the data.
Import data into your local IPFS node and generate an index:
$ importer.py dataset_dir index_pathe.g.
$ importer.py ./cv-corpus-7.0-2021-07-21/tr/ tr.jsonwhere the dataset_dir is in Common Voice format.
Index the data, extracting a balanced subset of clips by a complexity metric:
$ indexer.py locale index_pathe.g.
$ indexer.py tr tr.jsonThis will return a CID that looks like QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4Yeho
Publish data to the global index in OmniLingo on IPFS:
$ publisher.py locale cide.g.
$ publisher.py tr QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4YehoPublish to a name using the local node ID:
ipfs name publish cid e.g.
ipfs name publish QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4YehoTo publish model files (e.g. for the pronunciation assistance) you need a directory, containing two files:
models/LOCALE.tflite: The binary for the ASR modelmodels/LOCALE.json: Metadata for the model
The metadata file, e.g. pt.json for Portuguese, should look like:
{"format": "coqui", "type": "acoustic", "licence":"AGPL-3.0", "src":"https://itml.cl.indiana.edu/models/"}You can publish using:
python3 publisher.py --merge QmXMp1Dv1Sf7ZHXcH6puqbudBhDNkqngopadzcy8Qikuqt --with-model models/pt.tflite pt QmbWXcHWVdRFh3ZmXEbf4tXTk6nqp8zkaNa4aAxaeQ9VTQ