analysis-tibetan

Tibetan analysis plugin for Opensearch and Elasticsearch, mostly allowing access to the features in lucene-bo. Created by the Buddhist Digital Resource Center.

Compile

Make sure you have Java 21 installed and run

mvn package

This will produce zip files that you can use as plugins for OpenSearch or ElasticSearch:

opensearch/target/releases/opensearch-analysis-tibetan-*.zip
elasticsearch/target/releases/elasticsearch-analysis-tibetan-*.zip

See then:

Getting started

Optionally, copy Lucene-bo's synonyms.txt into a directory where OpenSearch / ElasticSearch will find it (e.g. /etc/opensearch/tibetan-synonyms.txt).

You can test the plugin with a simple example:

POST _analyze
{
  "tokenizer" : "tibetan",
  "filter" : ["tibetan"],
  "char_filter" : ["tibetan"],
  "text" : "ཀ་ཁཱ་ག"
}

You can then configure the plugin as you see fit, for example:

PUT /tibetantest/
{
  "settings": {
    "analysis": {
      "analyzer": {
        "tibetan-lenient": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-synonyms" ],
          "char_filter": [ "tibetan-lenient" ]
        },
        "tibetan-ewts-lenient": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-synonyms" ],
          "char_filter": [ "tibetan-ewts-lenient" ]
        },
        "tibetan-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-for-tibetan-phonetic" ],
          "char_filter": [ "tibetan-lenient" ]
        },
        "tibetan-for-english-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-for-english-phonetic" ],
          "char_filter": [ "tibetan-lenient" ]
        },
        "ewts-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-for-tibetan-phonetic" ],
          "char_filter": [ "tibetan-ewts-lenient" ]
        },
        "ewts-for-english-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-for-english-phonetic" ],
          "char_filter": [ "tibetan-ewts-lenient" ]
        },
        "tibetan-english-phonetic": {
          "tokenizer": "tibetan-english-phonetic",
          "char_filter": [ "tibetan-english-phonetic" ]
        }
      },
      "filter": {
        "tibetan-lenient": {
          "type": "tibetan",
          "remove_affixes": true,
          "normalize_paba": true
        },
        "tibetan-for-english-phonetic": {
          "type": "tibetan-for-english-phonetic"
        },
        "tibetan-for-tibetan-phonetic": {
          "type": "tibetan-for-tibetan-phonetic"
        },
        "tibetan-synonyms": {
          "type": "synonym_graph",
          "synonyms_path": "tibetan-synonyms.txt"
        }
      },
      "char_filter": {
        "tibetan-lenient": {
          "type": "tibetan",
          "lenient": true
        },
        "tibetan-english-phonetic": {
          "type": "tibetan-english-phonetic"
        },
        "tibetan-ewts-lenient": {
          "type": "tibetan",
          "lenient": true,
          "input_method": "ewts"
        }
      }
    }
  }
}

And you can then test the lenient support (note that ཁཱ is transformed into ཁ):

{
  "tokenizer" : "tibetan",
  "filter" : ["tibetan-lenient", "tibetan-synonyms"],
  "char_filter" : ["tibetan-lenient"],
  "text" : "ཀ་ཁཱ་ག་ཀླད་ཀོར"
}

and the transliteration support:

POST /tibetantest/_analyze
{
  "tokenizer" : "tibetan",
  "filter" : ["tibetan-lenient", "tibetan-synonyms"],
  "char_filter" : ["tibetan-ewts-lenient"],
  "text" : "ka khA ga klad kor"
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
elasticsearch		elasticsearch
opensearch		opensearch
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

analysis-tibetan

Compile

Getting started

Sources of inspiration and doc

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

buda-base/analysis-tibetan

Folders and files

Latest commit

History

Repository files navigation

analysis-tibetan

Compile

Getting started

Sources of inspiration and doc

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages