Tibetan analysis plugin for Opensearch and Elasticsearch, mostly allowing access to the features in lucene-bo. Created by the Buddhist Digital Resource Center.
Make sure you have Java 21 installed and run
mvn package
This will produce zip files that you can use as plugins for OpenSearch or ElasticSearch:
opensearch/target/releases/opensearch-analysis-tibetan-*.zip
elasticsearch/target/releases/elasticsearch-analysis-tibetan-*.zip
See then:
Optionally, copy Lucene-bo's synonyms.txt into a directory where OpenSearch / ElasticSearch will find it (e.g. /etc/opensearch/tibetan-synonyms.txt).
You can test the plugin with a simple example:
POST _analyze
{
"tokenizer" : "tibetan",
"filter" : ["tibetan"],
"char_filter" : ["tibetan"],
"text" : "ཀ་ཁཱ་ག"
}
You can then configure the plugin as you see fit, for example:
PUT /tibetantest/
{
"settings": {
"analysis": {
"analyzer": {
"tibetan-lenient": {
"tokenizer": "tibetan",
"filter": [ "tibetan-lenient", "tibetan-synonyms" ],
"char_filter": [ "tibetan-lenient" ]
},
"tibetan-ewts-lenient": {
"tokenizer": "tibetan",
"filter": [ "tibetan-lenient", "tibetan-synonyms" ],
"char_filter": [ "tibetan-ewts-lenient" ]
},
"tibetan-phonetic": {
"tokenizer": "tibetan",
"filter": [ "tibetan-lenient", "tibetan-for-tibetan-phonetic" ],
"char_filter": [ "tibetan-lenient" ]
},
"tibetan-for-english-phonetic": {
"tokenizer": "tibetan",
"filter": [ "tibetan-for-english-phonetic" ],
"char_filter": [ "tibetan-lenient" ]
},
"ewts-phonetic": {
"tokenizer": "tibetan",
"filter": [ "tibetan-lenient", "tibetan-for-tibetan-phonetic" ],
"char_filter": [ "tibetan-ewts-lenient" ]
},
"ewts-for-english-phonetic": {
"tokenizer": "tibetan",
"filter": [ "tibetan-for-english-phonetic" ],
"char_filter": [ "tibetan-ewts-lenient" ]
},
"tibetan-english-phonetic": {
"tokenizer": "tibetan-english-phonetic",
"char_filter": [ "tibetan-english-phonetic" ]
}
},
"filter": {
"tibetan-lenient": {
"type": "tibetan",
"remove_affixes": true,
"normalize_paba": true
},
"tibetan-for-english-phonetic": {
"type": "tibetan-for-english-phonetic"
},
"tibetan-for-tibetan-phonetic": {
"type": "tibetan-for-tibetan-phonetic"
},
"tibetan-synonyms": {
"type": "synonym_graph",
"synonyms_path": "tibetan-synonyms.txt"
}
},
"char_filter": {
"tibetan-lenient": {
"type": "tibetan",
"lenient": true
},
"tibetan-english-phonetic": {
"type": "tibetan-english-phonetic"
},
"tibetan-ewts-lenient": {
"type": "tibetan",
"lenient": true,
"input_method": "ewts"
}
}
}
}
}
And you can then test the lenient support (note that ཁཱ is transformed into ཁ):
{
"tokenizer" : "tibetan",
"filter" : ["tibetan-lenient", "tibetan-synonyms"],
"char_filter" : ["tibetan-lenient"],
"text" : "ཀ་ཁཱ་ག་ཀླད་ཀོར"
}
and the transliteration support:
POST /tibetantest/_analyze
{
"tokenizer" : "tibetan",
"filter" : ["tibetan-lenient", "tibetan-synonyms"],
"char_filter" : ["tibetan-ewts-lenient"],
"text" : "ka khA ga klad kor"
}