This project follows the layered architecture in
planning/06_corpus_architecture_and_class_dependencies.md.
parse_layer/corpus_shared_layer/corpus_representation_layer/corpus_search_layer/corpus_computing_engine/orchestration_layer/
- UDPipe source in
./udpipe - UDPipe model in
./models/english-partut-ud-2.5-191206.udpipe - SQLite3 development library
This repository does not include udpipe/ source or model binaries.
- UDPipe source code:
- https://github.com/ufal/udpipe
- Clone/copy into
./udpipe
- Example English UDPipe model:
- https://lindat.mff.cuni.cz/repository/items/41f05304-629f-4313-b9cf-9eeb0a2ca7c6
- Place the model file at:
./models/english-partut-ud-2.5-191206.udpipe
./tools/build.shThis script builds UDPipe (udpipe/src/libudpipe.a) if needed and then
builds corpus_build_pipeline.
./corpus_build_pipeline \
models/english-partut-ud-2.5-191206.udpipe \
_demo_corpus \
corpus_outputOutputs include core token binaries, structure files, dictionaries,
indexes, docfreq, and sparse-matrix files in corpus_output/.
The pipeline now emits:
- Core token binaries:
word.bin,lemma.bin,pos.bin,head.bin,deprel.bin
- Structure binaries:
sentence_bounds.bin,doc_ranges.bin,word_doc.bin
- Dictionaries:
word.lexicon.bin,lemma.lexicon.bin,pos.lexicon.bin,deprel.lexicon.bin
- Search/docfreq families:
word,lemma,2gram,3gram,4gram
- Sparse matrices:
word.spm.*,lemma.spm.*,2gram.spm.*,3gram.spm.*,4gram.spm.*
Word-based 2gram, 3gram, and 4gram artifacts are generated from
word.bin, not lemma.bin.
Optional semantic mapping rules can be supplied as a 4th argument:
./corpus_build_pipeline \
models/english-partut-ud-2.5-191206.udpipe \
_demo_corpus \
corpus_output \
/path/to/semantic_rules.tsvWhen semantic rules are provided, metadata and binary filter artifacts
for document groups are generated (for example: semantic.key.lexicon.bin,
semantic.value.lexicon.bin, semantic.value_doc.*,
semantic.doc_groups.*).
Optional output controls:
- 5th CLI arg: posting/docfreq format,
raworcompressed - 6th CLI arg: emit n-gram positions, anything except
falsemeans enabled
Example:
./corpus_build_pipeline \
models/english-partut-ud-2.5-191206.udpipe \
_demo_corpus \
corpus_output \
"" \
compressed \
trueThe JSON mode also accepts:
postingFormat:"raw"or"compressed"emitNgramPositions:trueorfalse
Run all checks manually:
./tools/run_ci_checks.shInstall local git hooks once per clone:
./tools/install_git_hooks.shThis sets core.hooksPath=.githooks and enables the starter hook set:
pre-commit, pre-push, commit-msg, and post-merge.
Optional heavier checks in hooks can be enabled with:
RUN_SMOKE_CHECKS=1forpre-commitsmoke pipeline runRUN_INTEGRATION_CHECKS=1forpre-pushintegration pipeline run