GitHub - maslinych/corbama-build: Build infrastructure for Bamrara Reference Corpus

Bambara Reference Corpus Build Infrastructure

This repository binds together all procedures required to build corpus indexes for the Bambara Reference Corpus.

Build process overview

Clone this repository
Install tools
Clone repositories with corpus resources
cd corbama-build
run make

Get tools

A list of tools build process depends on:

GNU Make and UNIX command line environment (bash, coreutils, sed, awk etc.)
NoSketchEngine — you'll need manatee-open package.
Daba. Clone this repo into the same directory where corbama-build resides.

Get corpus resources

Corpus resources are corpus source files and dictionaries. By default all resources are expected to reside at the parent directory of the corbama-build copy.

corbama — a corpus repository (private, not shown)
bamadaba — a lexical database, clone from github.

Run build procedure

Provided that you have directory structure as shown,

./
	bamadaba/
	corbama/
	corbama-build/
	daba/

simply run:

$ cd corbama-build
$ make makedirs
$ make resources
$ make compile

The process is time-consuming and may be sped up by using make -jN option with the number of processors/cores available for parallel build.

Publishing a corpus

To publish a corpus online you'll need to have a server set up for running NoSketchEngine's web engine bonito-open in a chrooted environment (for more details, see below).

Two environments are used for publishing:

testing — for initial upload and experiments;
production — for a stable corpus serving users.

NB Several different corpora may share a single environment (both testing and production). In such a case, watch out that your operations on one corpus do not overwrite other corpora in the same environment.

Thus, following steps need to be performed:

Create a testing environment (if not created before for some other corpus):

$ make create-testing

Setup bonito-open config files for your corpus:

$ make setup-bonito

Upload your corpus to the testing environment:

$ make install-testing

Start web-server in the testing environment and examine the corpus:

$ make start-testing

When ready, transform the testing environment into production. Remember that if your corpus is not alone in the environment, you must upload all other corpora into the same testing environment prior to performing this step.

$ make production
$ make start-production

Corpus files that are built

Corbama (BRC corpus short name) is subdivided into two subcorpora:

Manually disambiguated subcorpus (corbama-net* files), approx. 0.35M words
Full subcorpus, including disambiguated and non-disambiguated parts (corbama-brut and corbama-nul), approx. 2.1M words.

Both subcorpora come in two variants which differ in the orthography and the amount of tonal marking represented.

corbama-net-non-tonal.vert : Disambiguated subcorpus, tones absent (as in source texts).
corbama-net-tonal.vert : Disambiguated subcorpus, tones automatically added on word and lemma fields.
corbama-brut.vert : Full subcorpus (with non-disambiguated part), tones absent.
corbama-nul.vert : Full subcorpus with simplified orthography (open vowels replaced with their closed counterparts o,e), tones absent.

Annotation scheme

Corpus sources are compiled in vertical format as required by SketchEngine. Set of fields and structures are documented in config files for corresponding corpora included in config subdirectory). For the format of config files and vertical file format see SketchEngine docs: http://www.sketchengine.co.uk/documentation/wiki/SkE/PreparingCorpusOverview

Short notes on the semantics of the fields:

word : normalized word form (new latin Bambara orthography, qutomatically added tones when appicable);
lemma : automatically generated lemma, also normalized orthography and tones where applicable. In non-disambiguated contains all possible interpretations of the wordform provided by the rule-based parser Daba as an alternative lemmas.
tag : part of speech tag (one or more) plus grammatical tags of the derivative morphemes
gloss : French or standardized gloss. In non-disambiguated texts — list of possible variants.
parts : for derivative composite words a list of constituent stems.
original : original wordform as it is in the text (not normalized)
tonal : for non-tonal variants : form with automatically added tones
polisemy : for polysemous words — alternative glosses
tagstring: structure of ps tags on source Gloss object

Corpus server setup

To run an online corpus search engine you'll need:

A Sisyphus-based GNU/Linux operation system on a server.
Packages hasher hasher-priv tmux installed on a server.
A regular user on a sever with ssh access and setup for running two parallel hasher processes. To provide for the latter, you'll need to run as root:

$ hasher-useradd <your_user>
$ hasher-useradd --number=1 <your_user>

A web-server: nginx (recommended) or any other for proxying HTTP-requests to the corpora that will reside in chrooted environments. You will need to setup proxy for requests to testing and production. Ports where corresponding web-servers will be listening are listed in the Makefile in TESTPORT and PRODPORT variables.
On your local machine, setup ssh access credentials for the corpus user on the server in .ssh/config and place the corresponding host name in Makefile in the variable HOST (by default it uses hostname corpora).

You're done, all the rest will be handled by Makefile commands listed in the Publishing a corpus section of this README.

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
baabu_ni_baabu		baabu_ni_baabu
bailleul-sagesse_bambara		bailleul-sagesse_bambara
bamanankan_maben		bamanankan_maben
bird_hutchison_kante-an_ka_bamanankan_kalan		bird_hutchison_kante-an_ka_bamanankan_kalan
camara_moise		camara_moise
cero-mogoya1978_11_05		cero-mogoya1978_11_05
chansons		chansons
config		config
diarra-nsiirinw2012		diarra-nsiirinw2012
dibifara		dibifara
dogotoro		dogotoro
duguet_vincent		duguet_vincent
dumestre-chroniques_amoureuses		dumestre-chroniques_amoureuses
dumestre-geste_de_segu		dumestre-geste_de_segu
dumestre-manigances		dumestre-manigances
dumestre-maningances		dumestre-maningances
ebermann-gundofenw		ebermann-gundofenw
entretiens_sida		entretiens_sida
fakan		fakan
faso_kumakan/faso_kumakan1987_08_15		faso_kumakan/faso_kumakan1987_08_15
fasokan		fasokan
gorog-nsiirinw1979		gorog-nsiirinw1979
gorog_meyer-contes_bambara1974		gorog_meyer-contes_bambara1974
gorog_meyer-contes_bambara1985		gorog_meyer-contes_bambara1985
jabate-ngenyekoro_ka_tonnkan		jabate-ngenyekoro_ka_tonnkan
jama/jama14		jama/jama14
jekabaara		jekabaara
keita-folo_kita		keita-folo_kita
kibaru		kibaru
kolonkise/kolonkise10		kolonkise/kolonkise10
konta-nsiiriw_nsanaw		konta-nsiiriw_nsanaw
kurane		kurane
lahidukoro		lahidukoro
layidu_kura		layidu_kura
morales-dialogues		morales-dialogues
nyetaa		nyetaa
oteri_keyita-sogobo		oteri_keyita-sogobo
radio_mali		radio_mali
releases		releases
remote		remote
rfi		rfi
sankore		sankore
scripts		scripts
sharness		sharness
sidibe-contes_du_mali		sidibe-contes_du_mali
tamani_radio		tamani_radio
tests		tests
thoyer-contes1997		thoyer-contes1997
traore-hine_nana		traore-hine_nana
tunkara-nsiirin_ni_maanaw		tunkara-nsiirin_ni_maanaw
voice_of_america		voice_of_america
vydrine-cours_grammaire		vydrine-cours_grammaire
.gitattributes		.gitattributes
.gitignore		.gitignore
01npogotiginin_kokorobola.dis.dbs		01npogotiginin_kokorobola.dis.dbs
02sonsannin_surukuba.dis.dbs		02sonsannin_surukuba.dis.dbs
03dennyuman_ni_kononin.dis.dbs		03dennyuman_ni_kononin.dis.dbs
04dinye_yaalala.dis.dbs		04dinye_yaalala.dis.dbs
05donkesunguru.dis.dbs		05donkesunguru.dis.dbs
Makefile		Makefile
README.md		README.md
baa-fanta_maa_recit.dis.dbs		baa-fanta_maa_recit.dis.dbs
baguro-gabukoro_keyita.dis.dbs		baguro-gabukoro_keyita.dis.dbs
bailleul-ta_te_nya.dis.dbs		bailleul-ta_te_nya.dis.dbs
bailleul_dumestre_vydrine-npogotigiw_ni_bilisiw.dis.dbs		bailleul_dumestre_vydrine-npogotigiw_ni_bilisiw.dis.dbs
balo-daa_monson_ni_nyenama.dis.dbs		balo-daa_monson_ni_nyenama.dis.dbs
bamako_sigicogo.dis.dbs		bamako_sigicogo.dis.dbs
bana_minnu_ka_teli.dis.dbs		bana_minnu_ka_teli.dis.dbs
basiya.dis.dbs		basiya.dis.dbs
beenkeya.dis.dbs		beenkeya.dis.dbs
benkanseben_min_bora.dis.dbs		benkanseben_min_bora.dis.dbs
berete-faba_janjo.dis.dbs		berete-faba_janjo.dis.dbs
berson_traore-ka_sigidalafen_duntaw.dis.dbs		berson_traore-ka_sigidalafen_duntaw.dis.dbs
bolociw.dis.dbs		bolociw.dis.dbs
cero-banaw1978_10_31.dis.dbs		cero-banaw1978_10_31.dis.dbs
comment_construire.dis.dbs		comment_construire.dis.dbs
corbama-bam-fra.prl		corbama-bam-fra.prl
corbama-fra-bam.prl		corbama-fra-bam.prl
denw_ka_balo.dis.dbs		denw_ka_balo.dis.dbs
diakite-famori_et_sa_mere_sorciere.dis.dbs		diakite-famori_et_sa_mere_sorciere.dis.dbs
diallo-conquetes_el_hadj_omar.dis.dbs		diallo-conquetes_el_hadj_omar.dis.dbs
diarra-chants_circoncision.dis.dbs		diarra-chants_circoncision.dis.dbs
docker.mk		docker.mk
dogosugu_ni_dogotigelaw.dis.dbs		dogosugu_ni_dogotigelaw.dis.dbs
dukure-fatoya_ni_01dantige.dis.dbs		dukure-fatoya_ni_01dantige.dis.dbs
dukure-fatoya_ni_jigiya.dis.dbs		dukure-fatoya_ni_jigiya.dis.dbs
dukure-ni_san_cyenna.dis.dbs		dukure-ni_san_cyenna.dis.dbs
dumestre-prise_de_djonkoloni.dis.dbs		dumestre-prise_de_djonkoloni.dis.dbs
dunbiya_sangare-an_ka_yele.dis.dbs		dunbiya_sangare-an_ka_yele.dis.dbs
fane-la_peche_de_fabaly.dis.dbs		fane-la_peche_de_fabaly.dis.dbs
farafinfuraw.dis.dbs		farafinfuraw.dis.dbs
freqlist.py		freqlist.py
gindo-munna_warabilen.dis.dbs		gindo-munna_warabilen.dis.dbs
hadamaden_josiraw.dis.dbs		hadamaden_josiraw.dis.dbs
hadamaden_josiraw_2023.dis.dbs		hadamaden_josiraw_2023.dis.dbs
jaabi-ntentenw_ni_jaabiw.dis.dbs		jaabi-ntentenw_ni_jaabiw.dis.dbs
jakite-dolominbana.dis.dbs		jakite-dolominbana.dis.dbs
jara-falatonin_ni_dugudenw.dis.dbs		jara-falatonin_ni_dugudenw.dis.dbs
jara-sinamuso_jugu.dis.dbs		jara-sinamuso_jugu.dis.dbs
jara-teriw_saba.dis.dbs		jara-teriw_saba.dis.dbs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bambara Reference Corpus Build Infrastructure

Build process overview

Get tools

Get corpus resources

Run build procedure

Publishing a corpus

Corpus files that are built

Annotation scheme

Corpus server setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bambara Reference Corpus Build Infrastructure

Build process overview

Get tools

Get corpus resources

Run build procedure

Publishing a corpus

Corpus files that are built

Annotation scheme

Corpus server setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages