You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|[`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev)| Release Candidate for 2.3 |[Change Log](https://github.com/PyThaiNLP/pythainlp/issues/445)|
29
28
30
29
Please follow our [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) for more updates.
31
30
31
+
32
32
## Getting Started with PyThaiNLP
33
33
34
34
We provide [PyThaiNLP Get Started Tutorial](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html) for exploring features in PyThaiNLP; We also have tutorials for specific tasks. Please visit [our tutorial page](https://www.thainlp.org/pythainlp/tutorials).
@@ -37,27 +37,29 @@ Latest document is available at [https://thainlp.org/pythainlp/docs/2.2/](https:
37
37
38
38
We try to make the package easy to use as much as possible; therefore, some additional data (like word lists and language models) may get automatically download during runtime. PyThaiNLP caches additional data under the directory `~/pythainlp-data` by default, but the user can change the value by specifying the environment variable `PYTHAINLP_DATA_DIR`. See corpus catalog at [PyThaiNLP/pythainlp-corpus](https://github.com/PyThaiNLP/pythainlp-corpus).
39
39
40
+
40
41
## Capabilities
41
42
42
-
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speec tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
43
+
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
43
44
44
45
<details>
45
46
<summary>List of Features</summary>
46
47
47
48
- Convenient character and word classes, like Thai consonants (`pythainlp.thai_consonants`), vowels (`pythainlp.thai_vowels`), digits (`pythainlp.thai_digits`), and stop words (`pythainlp.corpus.thai_stopwords`) -- comparable to constants like `string.letters`, `string.digits`, and `string.punctuation`
48
49
- Thai linguistic unit segmentation/tokenization, including sentence (`sent_tokenize`), word (`word_tokenize`), and subword segmentations based on Thai Character Cluster (`subword_tokenize`)
49
-
- Thai part-of-speech taggers (`pos_tag`)
50
+
- Thai part-of-speech tagging (`pos_tag`)
50
51
- Thai spelling suggestion and correction (`spell` and `correct`)
51
52
- Thai transliteration (`transliterate`)
52
53
- Thai soundex (`soundex`) with three engines (`lk82`, `udom83`, `metasound`)
53
-
- Thai collation (sort by dictionoary order) (`collate`)
54
+
- Thai collation (sort by dictionary order) (`collate`)
54
55
- Read out number to Thai words (`bahttext`, `num_to_thaiword`)
- Command-line interface for basic functions, like tokenization and pos tagging (run `thainlp` in your shell)
58
59
</details>
59
60
60
-
Please see [our tutorials](https://www.thainlp.org/pythainlp/tutorials) on how to apply these functions to ML problems.
61
+
Please see [our tutorials](https://www.thainlp.org/pythainlp/tutorials) on how to apply these functions to machine-learning problems.
62
+
61
63
62
64
## Installation
63
65
@@ -66,7 +68,7 @@ pip install --upgrade pythainlp
66
68
```
67
69
68
70
This will install the latest stable release of PyThaiNLP.
69
-
PyThaiNLP uses pip as its package manger and PyPI as its main distribution channel, see [https://pypi.org/project/pythainlp/](https://pypi.org/project/pythainlp/)
71
+
PyThaiNLP uses pip as its package manager and PyPI as its main distribution channel, see [https://pypi.org/project/pythainlp/](https://pypi.org/project/pythainlp/)
For dependency details, look at `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).
100
102
101
103
102
-
## Command-line
104
+
## Command-Line Interface
103
105
104
-
Some of PyThaiNLP functionalities can be used at command line, using `thainlp`
106
+
Some of PyThaiNLP functionalities can be used at command line, using `thainlp` command.
105
107
106
108
For example, displaying a catalog of datasets:
107
109
```sh
@@ -121,6 +123,7 @@ thainlp help
121
123
-[Upgrade ThaiNER from 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
122
124
- Python 2.7 users can use PyThaiNLP 1.6
123
125
126
+
124
127
## Citations
125
128
126
129
If you use `PyThaiNLP` in your project or publication, please cite the library as follows
@@ -148,6 +151,7 @@ or BibTeX entry:
148
151
- Please do fork and create a pull request :)
149
152
- For style guide and other information, including references to algorithms we use, please refer to our [contributing](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md) page.
150
153
154
+
151
155
## Licenses
152
156
153
157
|| License |
@@ -157,6 +161,7 @@ or BibTeX entry:
157
161
| Language models created by PyThaiNLP |[Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/)|
158
162
| Other corpora and models that may included with PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md)|
159
163
164
+
160
165
## Sponsors
161
166
162
167
[](https://airesearch.in.th/)
@@ -168,3 +173,13 @@ Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have
168
173
<divalign="center">
169
174
Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭
170
175
</div>
176
+
177
+
------
178
+
179
+
<divalign="center">
180
+
<strong>We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp </strong>
181
+
</div>
182
+
183
+
<divalign="center">
184
+
<strong>Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.</strong>
Copy file name to clipboardExpand all lines: docs/api/tag.rst
+53-11Lines changed: 53 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,12 @@
2
2
3
3
pythainlp.tag
4
4
=====================================
5
-
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text including
6
-
Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.
5
+
The :class:`pythainlp.tag` contains functions that are used to mark linguistic and other annotation to different parts of a text including
6
+
part-of-speech (POS) tag and named entity (NE) tag.
7
7
8
-
For the POS tags, there are two set of tags including `Universal Dependencies (UD)<https://universaldependencies.org/>`_ and ORCHID [#Sornlertlamvanich_2000]_POS tags.
8
+
For POS tags, there are three set of available tags: `Universal POS tags<https://universaldependencies.org/>`_, ORCHID POS tags [#Sornlertlamvanich_2000]_, and LST20 POS tags [#Prachya_2020]_.
9
9
10
-
The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:
10
+
The following table shows Universal POS tags as used in Universal Dependencies (UD):
@@ -93,7 +93,7 @@ Abbreviation Part-of-Speech tag Examples
93
93
94
94
ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.
95
95
96
-
The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:
96
+
The following table shows the mapping of POS tags from ORCHID to UD:
97
97
98
98
=============== =======================
99
99
ORCHID POS tags Coresponding UD POS tag
@@ -161,15 +161,54 @@ PUNCT PUNCT
161
161
PUNC PUNCT
162
162
=============== =======================
163
163
164
-
For the NER, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NER for each words.
165
-
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would be tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" as "B-PERSON", "I-PERSON", "I-PERSON", "O", and "O" respectively.
164
+
Details about LST20 POS tags are available in [#Prachya_2020]_.
166
165
167
-
The *B-* prefix indicates begining token for a chunk of person name, "บารัค โอบามา" and *I-* prefix indicates the intermediate token. However, the term *O* indicates that a token not belong to any NER chunk.
166
+
The following table shows the mapping of POS tags from LST20 to UD:
168
167
169
-
The following table shows the list of Named Entity Recognition (NER) tags:
168
+
+----------------+-------------------------+
169
+
| LST20 POS tags | Coresponding UD POS tag |
170
+
+================+=========================+
171
+
| AJ | ADJ |
172
+
+----------------+-------------------------+
173
+
| AV | ADV |
174
+
+----------------+-------------------------+
175
+
| AX | AUX |
176
+
+----------------+-------------------------+
177
+
| CC | CCONJ |
178
+
+----------------+-------------------------+
179
+
| CL | NOUN |
180
+
+----------------+-------------------------+
181
+
| FX | NOUN |
182
+
+----------------+-------------------------+
183
+
| IJ | INTJ |
184
+
+----------------+-------------------------+
185
+
| NN | NOUN |
186
+
+----------------+-------------------------+
187
+
| NU | NUM |
188
+
+----------------+-------------------------+
189
+
| PA | PART |
190
+
+----------------+-------------------------+
191
+
| PR | PROPN |
192
+
+----------------+-------------------------+
193
+
| PS | ADP |
194
+
+----------------+-------------------------+
195
+
| PU | PUNCT |
196
+
+----------------+-------------------------+
197
+
| VV | VERB |
198
+
+----------------+-------------------------+
199
+
| XX | X |
200
+
+----------------+-------------------------+
201
+
202
+
For the NE, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NE for each word.
203
+
204
+
*B-* prefix indicates the begining token of the chunk. *I-* prefix indicates the intermediate token within the chunk. *O* indicates that the token does not belong to any NE chunk.
205
+
206
+
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" with "B-PERSON", "I-PERSON", "O", and "O" respectively.
207
+
208
+
The following table shows named entity (NE) tags as used PyThaiNLP:
.. [#Sornlertlamvanich_2000] Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. (2000).
215
254
Building a Thai Part-Of-Speech Tagged Corpus (ORCHID).
216
255
The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999.
256
+
.. [#Prachya_2020] Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika Boriboon and Krit Kosawat and Thepchai Supnithi. (2020).
0 commit comments