Merge pull request #185 from bact/dev

bact · web-flow · commit e19ec2883929 · 2019-04-06T14:12:18.000+07:00
More descriptions for Get Started notebook
diff --git a/notebooks/pythainlp-get-started.ipynb b/notebooks/pythainlp-get-started.ipynb
@@ -4,14 +4,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# PyThaiNLP Get Started"
+    "# PyThaiNLP Get Started\n",
+    "\n",
+    "Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Collation"
+    "## Collation\n",
+    "\n",
+    "Sorting according to Thai dictionary."
    ]
   },
   {
@@ -40,7 +44,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Date and Time Format"
+    "## Date and Time Format\n",
+    "\n",
+    "Get Thai day and month names with Buddhist Era."
    ]
   },
   {
@@ -80,7 +86,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Thai Character Cluster (TCC) and Extended TCC"
+    "### Thai Character Cluster (TCC) and Extended TCC\n",
+    "\n",
+    "According to [Character Cluster Based Thai Information Retrieval](https://www.researchgate.net/publication/2853284_Character_Cluster_Based_Thai_Information_Retrieval) (Theeramunkong et al. 2004)."
    ]
   },
   {
@@ -167,7 +175,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Sentence and Word"
+    "### Sentence and Word\n",
+    "\n",
+    "Default word tokenizer (\"newmm\") use maximum matching algorithm."
    ]
   },
   {
@@ -195,6 +205,13 @@
     "print(\"word_tokenize, without whitespace:\", word_tokenize(text, whitespaces=False))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Other algorithm can be chosen. We can also create a tokenizer with custom dictionary."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -223,6 +240,14 @@
     "print(\"custom:\", custom_tokenizer.word_tokenize(text))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Default word tokenizer use a word list from pythainlp.corpus.common.thai_words().\n",
+    "We can get that list, add/remove words, and create new tokenizer from the modified list."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
@@ -332,7 +357,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Soundex"
+    "## Soundex\n",
+    "\n",
+    "\"Soundex is a phonetic algorithm for indexing names by sound.\" ([Wikipedia](https://en.wikipedia.org/wiki/Soundex)). PyThaiNLP provides three kinds of Thai soundex."
    ]
   },
   {
@@ -344,28 +371,19 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
-      "บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
-      "มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
-      "มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
-      "มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
-      "ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
-      "รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
-      "รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
-      " - lk82:  - udom83:  - metasound: \n"
+      "True\n",
+      "True\n",
+      "True\n"
      ]
     }
    ],
    "source": [
     "from pythainlp.soundex import lk82, metasound, udom83\n",
     "\n",
-    "texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
-    "for text in texts:\n",
-    "    print(\n",
-    "        \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
-    "            text, lk82(text), udom83(text), metasound(text)\n",
-    "        )\n",
-    "    )"
+    "# check equivalence\n",
+    "print(lk82(\"รถ\") == lk82(\"รด\"))\n",
+    "print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
+    "print(metasound(\"นพ\") == metasound(\"นภ\"))"
    ]
   },
   {
@@ -377,17 +395,26 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "True\n",
-      "True\n",
-      "True\n"
+      "บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550\n",
+      "บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551\n",
+      "มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
+      "มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100\n",
+      "มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551\n",
+      "ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100\n",
+      "รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
+      "รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100\n",
+      " - lk82:  - udom83:  - metasound: \n"
      ]
     }
    ],
    "source": [
-    "# check equivalence\n",
-    "print(lk82(\"รถ\") == lk82(\"รด\"))\n",
-    "print(udom83(\"วรร\") == udom83(\"วัน\"))\n",
-    "print(metasound(\"นพ\") == metasound(\"นภ\"))"
+    "texts = [\"บูรณะ\", \"บูรณการ\", \"มัก\", \"มัค\", \"มรรค\", \"ลัก\", \"รัก\", \"รักษ์\", \"\"]\n",
+    "for text in texts:\n",
+    "    print(\n",
+    "        \"{} - lk82: {} - udom83: {} - metasound: {}\".format(\n",
+    "            text, lk82(text), udom83(text), metasound(text)\n",
+    "        )\n",
+    "    )"
    ]
   },
   {
@@ -396,7 +423,7 @@
    "source": [
     "## Spellchecking\n",
     "\n",
-    "Default spellchecker use Peter Norvig's algorithm together with word frequency from Thai National Corpus (TNC)"
+    "Default spellchecker uses [Peter Norvig's algorithm](http://www.norvig.com/spell-correct.html) together with word frequency from Thai National Corpus (TNC)"
    ]
   },
   {
@@ -603,7 +630,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Named-Entity Tagging"
+    "## Named-Entity Tagging\n",
+    "\n",
+    "The tagger use BIO scheme:\n",
+    "- B - beginning of entity\n",
+    "- I - inside entity\n",
+    "- O - outside entity"
    ]
   },
   {