|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# PyThaiNLP Get Started" |
| 7 | + "# PyThaiNLP Get Started\n", |
| 8 | + "\n", |
| 9 | + "Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp" |
8 | 10 | ] |
9 | 11 | }, |
10 | 12 | { |
11 | 13 | "cell_type": "markdown", |
12 | 14 | "metadata": {}, |
13 | 15 | "source": [ |
14 | | - "## Collation" |
| 16 | + "## Collation\n", |
| 17 | + "\n", |
| 18 | + "Sorting according to Thai dictionary." |
15 | 19 | ] |
16 | 20 | }, |
17 | 21 | { |
|
40 | 44 | "cell_type": "markdown", |
41 | 45 | "metadata": {}, |
42 | 46 | "source": [ |
43 | | - "## Date and Time Format" |
| 47 | + "## Date and Time Format\n", |
| 48 | + "\n", |
| 49 | + "Get Thai day and month names with Buddhist Era." |
44 | 50 | ] |
45 | 51 | }, |
46 | 52 | { |
|
80 | 86 | "cell_type": "markdown", |
81 | 87 | "metadata": {}, |
82 | 88 | "source": [ |
83 | | - "### Thai Character Cluster (TCC) and Extended TCC" |
| 89 | + "### Thai Character Cluster (TCC) and Extended TCC\n", |
| 90 | + "\n", |
| 91 | + "According to [Character Cluster Based Thai Information Retrieval](https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval) (Theeramunkong et al. 2004)." |
84 | 92 | ] |
85 | 93 | }, |
86 | 94 | { |
|
167 | 175 | "cell_type": "markdown", |
168 | 176 | "metadata": {}, |
169 | 177 | "source": [ |
170 | | - "### Sentence and Word" |
| 178 | + "### Sentence and Word\n", |
| 179 | + "\n", |
| 180 | + "Default word tokenizer (\"newmm\") use maximum matching algorithm." |
171 | 181 | ] |
172 | 182 | }, |
173 | 183 | { |
|
195 | 205 | "print(\"word_tokenize, without whitespace:\", word_tokenize(text, whitespaces=False))" |
196 | 206 | ] |
197 | 207 | }, |
| 208 | + { |
| 209 | + "cell_type": "markdown", |
| 210 | + "metadata": {}, |
| 211 | + "source": [ |
| 212 | + "Other algorithm can be chosen. We can also create a tokenizer with custom dictionary." |
| 213 | + ] |
| 214 | + }, |
198 | 215 | { |
199 | 216 | "cell_type": "code", |
200 | 217 | "execution_count": 8, |
|
223 | 240 | "print(\"custom:\", custom_tokenizer.word_tokenize(text))" |
224 | 241 | ] |
225 | 242 | }, |
| 243 | + { |
| 244 | + "cell_type": "markdown", |
| 245 | + "metadata": {}, |
| 246 | + "source": [ |
| 247 | + "Default word tokenizer use a word list from pythainlp.corpus.common.thai_words().\n", |
| 248 | + "We can get that list, add/remove words, and create new tokenizer from the modified list." |
| 249 | + ] |
| 250 | + }, |
226 | 251 | { |
227 | 252 | "cell_type": "code", |
228 | 253 | "execution_count": 9, |
|
332 | 357 | "cell_type": "markdown", |
333 | 358 | "metadata": {}, |
334 | 359 | "source": [ |
335 | | - "## Soundex" |
| 360 | + "## Soundex\n", |
| 361 | + "\n", |
| 362 | + "\"Soundex is a phonetic algorithm for indexing names by sound.\" ([Wikipedia](https://en.wikipedia.org/wiki/Soundex)). PyThaiNLP provides three kinds of Thai soundex." |
336 | 363 | ] |
337 | 364 | }, |
338 | 365 | { |
|
344 | 371 | "name": "stdout", |
345 | 372 | "output_type": "stream", |
346 | 373 | "text": [ |
347 | | - "บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n", |
348 | | - "บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n", |
349 | | - "มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n", |
350 | | - "มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n", |
351 | | - "มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n", |
352 | | - "ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n", |
353 | | - "รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n", |
354 | | - "รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n", |
355 | | - " - lk82: - udom83: - metasound: \n" |
| 374 | + "True\n", |
| 375 | + "True\n", |
| 376 | + "True\n" |
356 | 377 | ] |
357 | 378 | } |
358 | 379 | ], |
359 | 380 | "source": [ |
360 | 381 | "from pythainlp.soundex import lk82, metasound, udom83\n", |
361 | 382 | "\n", |
362 | | - "texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n", |
363 | | - "for text in texts:\n", |
364 | | - " print(\n", |
365 | | - " \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n", |
366 | | - " text, lk82(text), udom83(text), metasound(text)\n", |
367 | | - " )\n", |
368 | | - " )" |
| 383 | + "# check equivalence\n", |
| 384 | + "print(lk82(\"รถ\") == lk82(\"รด\"))\n", |
| 385 | + "print(udom83(\"วรร\") == udom83(\"วัน\"))\n", |
| 386 | + "print(metasound(\"นพ\") == metasound(\"นภ\"))" |
369 | 387 | ] |
370 | 388 | }, |
371 | 389 | { |
|
377 | 395 | "name": "stdout", |
378 | 396 | "output_type": "stream", |
379 | 397 | "text": [ |
380 | | - "True\n", |
381 | | - "True\n", |
382 | | - "True\n" |
| 398 | + "บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n", |
| 399 | + "บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n", |
| 400 | + "มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n", |
| 401 | + "มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n", |
| 402 | + "มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n", |
| 403 | + "ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n", |
| 404 | + "รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n", |
| 405 | + "รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n", |
| 406 | + " - lk82: - udom83: - metasound: \n" |
383 | 407 | ] |
384 | 408 | } |
385 | 409 | ], |
386 | 410 | "source": [ |
387 | | - "# check equivalence\n", |
388 | | - "print(lk82(\"รถ\") == lk82(\"รด\"))\n", |
389 | | - "print(udom83(\"วรร\") == udom83(\"วัน\"))\n", |
390 | | - "print(metasound(\"นพ\") == metasound(\"นภ\"))" |
| 411 | + "texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n", |
| 412 | + "for text in texts:\n", |
| 413 | + " print(\n", |
| 414 | + " \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n", |
| 415 | + " text, lk82(text), udom83(text), metasound(text)\n", |
| 416 | + " )\n", |
| 417 | + " )" |
391 | 418 | ] |
392 | 419 | }, |
393 | 420 | { |
|
396 | 423 | "source": [ |
397 | 424 | "## Spellchecking\n", |
398 | 425 | "\n", |
399 | | - "Default spellchecker use Peter Norvig's algorithm together with word frequency from Thai National Corpus (TNC)" |
| 426 | + "Default spellchecker uses [Peter Norvig's algorithm](http://www.norvig.com/spell-correct.html) together with word frequency from Thai National Corpus (TNC)" |
400 | 427 | ] |
401 | 428 | }, |
402 | 429 | { |
|
603 | 630 | "cell_type": "markdown", |
604 | 631 | "metadata": {}, |
605 | 632 | "source": [ |
606 | | - "## Named-Entity Tagging" |
| 633 | + "## Named-Entity Tagging\n", |
| 634 | + "\n", |
| 635 | + "The tagger use BIO scheme:\n", |
| 636 | + "- B - beginning of entity\n", |
| 637 | + "- I - inside entity\n", |
| 638 | + "- O - outside entity" |
607 | 639 | ] |
608 | 640 | }, |
609 | 641 | { |
|
0 commit comments