BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning

Overview

The Bayelemabaga dataset is a collection of 44562 aligned machine translation ready Bambara-French lines, originating from Corpus Bambara de Reference. The dataset is constitued of text extracted from 264 text files, varing from periodicals, books, short stories, blog posts, part of the Bible and the Quran.

Snapshot: 46976


Lines	46976
French Tokens (spacy)	691312
Bambara Tokens (daba)	660732
French Types	32018
Bambara Types	29382
Avg. Fr line length	77.6
Avg. Bam line length	61.69
Number of text sources	264

Data Splits


Train	80%	37580
Valid	10%	4698
Test	10%	4698

Remarks

We are working on resolving some last minute misalignment issues.

Maintenance

This dataset is supposed to be actively maintained.

Benchmarks:

Coming soon

Sources

sources

To note:

ʃ => (sh/shy) sound: Symbol left in the dataset, although not a part of bambara orthography nor French orthography.

License

CC-BY-SA-4.0

Version

1.0.1

Citation

@misc{bayelemabagamldataset2022
    title={Machine Learning Dataset Development for Manding Languages},
    author={
        Valentin Vydrin and
        Jean-Jacques Meric and
        Kirill Maslinsky and
        Andrij Rovenchak and
        Allahsera Auguste Tapo and
        Sebastien Diarra and
        Christopher Homan and
        Marco Zampieri and
        Michael Leventhal
    },
    howpublished = {url{https://github.com/robotsmali-ai/datasets}},
    year={2022}
}

Contacts

sdiarra <at> robotsmali.org
aat3261 <at> rit.edu

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bayelemabaga		bayelemabaga
LICENSE		LICENSE
README.md		README.md
bayelemabaga.py		bayelemabaga.py
bayelemabaga.tar.gz		bayelemabaga.tar.gz
index.html		index.html
index.js		index.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning

Overview

Snapshot: 46976

Data Splits

Remarks

Maintenance

Benchmarks:

Sources

To note:

License

Version

Citation

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning

Overview

Snapshot: 46976

Data Splits

Remarks

Maintenance

Benchmarks:

Sources

To note:

License

Version

Citation

Contacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages