Skip to content

Commit cab8230

Browse files
committed
More descriptions
1 parent 8ec724e commit cab8230

File tree

1 file changed

+63
-31
lines changed

1 file changed

+63
-31
lines changed

notebooks/pythainlp-get-started.ipynb

Lines changed: 63 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,18 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# PyThaiNLP Get Started"
7+
"# PyThaiNLP Get Started\n",
8+
"\n",
9+
"Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp"
810
]
911
},
1012
{
1113
"cell_type": "markdown",
1214
"metadata": {},
1315
"source": [
14-
"## Collation"
16+
"## Collation\n",
17+
"\n",
18+
"Sorting according to Thai dictionary."
1519
]
1620
},
1721
{
@@ -40,7 +44,9 @@
4044
"cell_type": "markdown",
4145
"metadata": {},
4246
"source": [
43-
"## Date and Time Format"
47+
"## Date and Time Format\n",
48+
"\n",
49+
"Get Thai day and month names with Buddhist Era."
4450
]
4551
},
4652
{
@@ -80,7 +86,9 @@
8086
"cell_type": "markdown",
8187
"metadata": {},
8288
"source": [
83-
"### Thai Character Cluster (TCC) and Extended TCC"
89+
"### Thai Character Cluster (TCC) and Extended TCC\n",
90+
"\n",
91+
"According to [Character Cluster Based Thai Information Retrieval](https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval) (Theeramunkong et al. 2004)."
8492
]
8593
},
8694
{
@@ -167,7 +175,9 @@
167175
"cell_type": "markdown",
168176
"metadata": {},
169177
"source": [
170-
"### Sentence and Word"
178+
"### Sentence and Word\n",
179+
"\n",
180+
"Default word tokenizer (\"newmm\") use maximum matching algorithm."
171181
]
172182
},
173183
{
@@ -195,6 +205,13 @@
195205
"print(\"word_tokenize, without whitespace:\", word_tokenize(text, whitespaces=False))"
196206
]
197207
},
208+
{
209+
"cell_type": "markdown",
210+
"metadata": {},
211+
"source": [
212+
"Other algorithm can be chosen. We can also create a tokenizer with custom dictionary."
213+
]
214+
},
198215
{
199216
"cell_type": "code",
200217
"execution_count": 8,
@@ -223,6 +240,14 @@
223240
"print(\"custom:\", custom_tokenizer.word_tokenize(text))"
224241
]
225242
},
243+
{
244+
"cell_type": "markdown",
245+
"metadata": {},
246+
"source": [
247+
"Default word tokenizer use a word list from pythainlp.corpus.common.thai_words().\n",
248+
"We can get that list, add/remove words, and create new tokenizer from the modified list."
249+
]
250+
},
226251
{
227252
"cell_type": "code",
228253
"execution_count": 9,
@@ -332,7 +357,9 @@
332357
"cell_type": "markdown",
333358
"metadata": {},
334359
"source": [
335-
"## Soundex"
360+
"## Soundex\n",
361+
"\n",
362+
"\"Soundex is a phonetic algorithm for indexing names by sound.\" ([Wikipedia](https://en.wikipedia.org/wiki/Soundex)). PyThaiNLP provides three kinds of Thai soundex."
336363
]
337364
},
338365
{
@@ -344,28 +371,19 @@
344371
"name": "stdout",
345372
"output_type": "stream",
346373
"text": [
347-
"บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
348-
"บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
349-
"มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
350-
"มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
351-
"มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
352-
"ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
353-
"รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
354-
"รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
355-
" - lk82: - udom83: - metasound: \n"
374+
"True\n",
375+
"True\n",
376+
"True\n"
356377
]
357378
}
358379
],
359380
"source": [
360381
"from pythainlp.soundex import lk82, metasound, udom83\n",
361382
"\n",
362-
"texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
363-
"for text in texts:\n",
364-
" print(\n",
365-
" \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
366-
" text, lk82(text), udom83(text), metasound(text)\n",
367-
" )\n",
368-
" )"
383+
"# check equivalence\n",
384+
"print(lk82(\"รถ\") == lk82(\"รด\"))\n",
385+
"print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
386+
"print(metasound(\"นพ\") == metasound(\"นภ\"))"
369387
]
370388
},
371389
{
@@ -377,17 +395,26 @@
377395
"name": "stdout",
378396
"output_type": "stream",
379397
"text": [
380-
"True\n",
381-
"True\n",
382-
"True\n"
398+
"บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
399+
"บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
400+
"มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
401+
"มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
402+
"มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
403+
"ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
404+
"รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
405+
"รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
406+
" - lk82: - udom83: - metasound: \n"
383407
]
384408
}
385409
],
386410
"source": [
387-
"# check equivalence\n",
388-
"print(lk82(\"รถ\") == lk82(\"รด\"))\n",
389-
"print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
390-
"print(metasound(\"นพ\") == metasound(\"นภ\"))"
411+
"texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
412+
"for text in texts:\n",
413+
" print(\n",
414+
" \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
415+
" text, lk82(text), udom83(text), metasound(text)\n",
416+
" )\n",
417+
" )"
391418
]
392419
},
393420
{
@@ -396,7 +423,7 @@
396423
"source": [
397424
"## Spellchecking\n",
398425
"\n",
399-
"Default spellchecker use Peter Norvig's algorithm together with word frequency from Thai National Corpus (TNC)"
426+
"Default spellchecker uses [Peter Norvig's algorithm](http://www.norvig.com/spell-correct.html) together with word frequency from Thai National Corpus (TNC)"
400427
]
401428
},
402429
{
@@ -603,7 +630,12 @@
603630
"cell_type": "markdown",
604631
"metadata": {},
605632
"source": [
606-
"## Named-Entity Tagging"
633+
"## Named-Entity Tagging\n",
634+
"\n",
635+
"The tagger use BIO scheme:\n",
636+
"- B - beginning of entity\n",
637+
"- I - inside entity\n",
638+
"- O - outside entity"
607639
]
608640
},
609641
{

0 commit comments

Comments
 (0)