Skip to content

buda-base/analysis-tibetan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

analysis-tibetan

Tibetan analysis plugin for Opensearch and Elasticsearch, mostly allowing access to the features in lucene-bo. Created by the Buddhist Digital Resource Center.

Compile

Make sure you have Java 21 installed and run

mvn package

This will produce zip files that you can use as plugins for OpenSearch or ElasticSearch:

opensearch/target/releases/opensearch-analysis-tibetan-*.zip
elasticsearch/target/releases/elasticsearch-analysis-tibetan-*.zip

See then:

Getting started

Optionally, copy Lucene-bo's synonyms.txt into a directory where OpenSearch / ElasticSearch will find it (e.g. /etc/opensearch/tibetan-synonyms.txt).

You can test the plugin with a simple example:

POST _analyze
{
  "tokenizer" : "tibetan",
  "filter" : ["tibetan"],
  "char_filter" : ["tibetan"],
  "text" : "ཀ་ཁཱ་ག"
}

You can then configure the plugin as you see fit, for example:

PUT /tibetantest/
{
  "settings": {
    "analysis": {
      "analyzer": {
        "tibetan-lenient": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-synonyms" ],
          "char_filter": [ "tibetan-lenient" ]
        },
        "tibetan-ewts-lenient": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-synonyms" ],
          "char_filter": [ "tibetan-ewts-lenient" ]
        },
        "tibetan-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-for-tibetan-phonetic" ],
          "char_filter": [ "tibetan-lenient" ]
        },
        "tibetan-for-english-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-for-english-phonetic" ],
          "char_filter": [ "tibetan-lenient" ]
        },
        "ewts-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-lenient", "tibetan-for-tibetan-phonetic" ],
          "char_filter": [ "tibetan-ewts-lenient" ]
        },
        "ewts-for-english-phonetic": {
          "tokenizer": "tibetan",
          "filter": [ "tibetan-for-english-phonetic" ],
          "char_filter": [ "tibetan-ewts-lenient" ]
        },
        "tibetan-english-phonetic": {
          "tokenizer": "tibetan-english-phonetic",
          "char_filter": [ "tibetan-english-phonetic" ]
        }
      },
      "filter": {
        "tibetan-lenient": {
          "type": "tibetan",
          "remove_affixes": true,
          "normalize_paba": true
        },
        "tibetan-for-english-phonetic": {
          "type": "tibetan-for-english-phonetic"
        },
        "tibetan-for-tibetan-phonetic": {
          "type": "tibetan-for-tibetan-phonetic"
        },
        "tibetan-synonyms": {
          "type": "synonym_graph",
          "synonyms_path": "tibetan-synonyms.txt"
        }
      },
      "char_filter": {
        "tibetan-lenient": {
          "type": "tibetan",
          "lenient": true
        },
        "tibetan-english-phonetic": {
          "type": "tibetan-english-phonetic"
        },
        "tibetan-ewts-lenient": {
          "type": "tibetan",
          "lenient": true,
          "input_method": "ewts"
        }
      }
    }
  }
}

And you can then test the lenient support (note that ཁཱ is transformed into ):

{
  "tokenizer" : "tibetan",
  "filter" : ["tibetan-lenient", "tibetan-synonyms"],
  "char_filter" : ["tibetan-lenient"],
  "text" : "ཀ་ཁཱ་ག་ཀླད་ཀོར"
}

and the transliteration support:

POST /tibetantest/_analyze
{
  "tokenizer" : "tibetan",
  "filter" : ["tibetan-lenient", "tibetan-synonyms"],
  "char_filter" : ["tibetan-ewts-lenient"],
  "text" : "ka khA ga klad kor"
}

Sources of inspiration and doc

About

Tibetan analyzer for Opensearch and Elasticsearch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages