Collection of Thai NLP libraries, dictionaries, and corpus. Always welcome for pull requests.
| Library | Description | Programming Languages | Features | License | Author & Link |
|---|---|---|---|---|---|
| TCC | Thai Character Cluster | C | Thanaruk et.al. |
| Library | Description | Programming Languages | Features | License | Author & Link |
|---|---|---|---|---|---|
| Swath | SWATH (Smart Word Analysis for THai) is a word segmentation for Thai | C | Longest Matching, Maximal Matching and Part-of-Speech Bigram. | GPL | CMU |
| Lexto | Lexto: Thai Lexeme Tokenizer | Java | LGPL | NECTEC | |
| Python 2 | LGPL | Python2 Wrapper | |||
| Python 3 | LGPL | Python3 Wrapper | |||
| Wordcut | Thai word breaker for Node.js | JavaScript, Node.JS | LGPL-3.0 | Veer66, github | |
| CutKum | Thai Word-Segmentation with Deep Learning in Tensorflow | Python | 0.93 recall, 0.92 precision, 0.93 F-measure. | MIT | Pucktada, github |
| Library | Description | Programming Languages | Features | License | Author & Link |
|---|---|---|---|---|---|
| Jitar+NAiST | A simple Trigram HMM part-of-speech tagger | Java | Ver66, Jitar + NAiST, 1 + NAiST, 2 |
| Library | Description | Programming Languages | Features | License | Author & Link |
|---|---|---|---|---|---|
| Chart-parser | Extract Syntactic Structure from POS Tagged Sentence. | C | Copyright | Thanaruk T. (thanaruk@siit.tu.ac.th) | |
| Grammar Processing | Labelled Buckets -> Context Free Grammer (CFG) | Python | Transform and compute probability | Thodsaporn C. |
| Library | Description | Size | Features | License | Link |
|---|---|---|---|---|---|
| Transliteration Corpus | 31K pairs | Thai-Eng Translation Pair | CC BY-NC-SA 3.0 TH | NECTEC | |
| Lexitron | Opensource Thai-English Dictionary | TH->EN, EN->TH | LGPL | NECTEC |
| Library | Description | Size | Features | License | Link |
|---|---|---|---|---|---|
| ORCHID | 30K sent. | Word Seg., POS Tagged. | CC BY-NC-SA 3.0 TH | NECTEC | |
| InterBEST 2009/2010 | 5M words | Word Seg. | CC BY-NC-SA 3.0 TH | NECTEC | |
| Thai Wikipedia | Formal Articles | 1.49GB (~213.1 MB compressed) | XML | GFDL | WIKIPEDIA |
| TNC Top-5000 Words | Word frequency | 5,000 words | Frequency of Thai words in various genres, EXCEL | Copyright | CHULA |
| Library | Description | Size | Features | License | Link |
|---|---|---|---|---|---|
| Thai National Corpus 2 | 32M words. | Query text by genre, domain | Copyright | CHULA |
| Pre-trained Model | Description | Size | Dimensions | License | Link |
|---|---|---|---|---|---|
| fastText | Skip-Gram model trained on Wikipedia using fastText | 300 | CC BY-SA 3.0 | Facebook + Bin & Text + Text Only |