Skip to content

Commit 1c5818f

Browse files
authored
Merge pull request #482 from PyThaiNLP/dev (build and deploy docs)
Update 2.2 branche from dev branche
2 parents ea97162 + 008470a commit 1c5818f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1081
-486
lines changed

.github/workflows/pythainlp-test.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ jobs:
2323
run: |
2424
python -m pip install --upgrade pip pytest wheel flake8
2525
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
26+
pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
2627
pip install .[full]
2728
pip install deepcut coverage coveralls
2829
- name: Lint with flake8
@@ -34,4 +35,4 @@ jobs:
3435
- name: Test
3536
run: |
3637
coverage run -m unittest discover
37-
CI_BRANCH=${GITHUB_REF#"ref/heads"} COVERALLS_REPO_TOKEN=${{ secrets.COVERALLS_REPO_TOKEN }} coveralls
38+
CI_BRANCH=${GITHUB_REF#"ref/heads"} COVERALLS_REPO_TOKEN=${{ secrets.COVERALLS_REPO_TOKEN }} coveralls

README.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
<div align="center">
32
<img src="https://avatars0.githubusercontent.com/u/32934255?s=200&v=4"/>
43
<h1>PyThaiNLP: Thai Natural Language Processing in Python</h1>
@@ -24,11 +23,12 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร
2423
2524
| Version | Description | Status |
2625
|:------:|:--:|:------:|
27-
| [2.2.3](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
26+
| [2.2.4](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
2827
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 2.3 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/445) |
2928

3029
Please follow our [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) for more updates.
3130

31+
3232
## Getting Started with PyThaiNLP
3333

3434
We provide [PyThaiNLP Get Started Tutorial](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html) for exploring features in PyThaiNLP; We also have tutorials for specific tasks. Please visit [our tutorial page](https://www.thainlp.org/pythainlp/tutorials).
@@ -37,27 +37,29 @@ Latest document is available at [https://thainlp.org/pythainlp/docs/2.2/](https:
3737

3838
We try to make the package easy to use as much as possible; therefore, some additional data (like word lists and language models) may get automatically download during runtime. PyThaiNLP caches additional data under the directory `~/pythainlp-data` by default, but the user can change the value by specifying the environment variable `PYTHAINLP_DATA_DIR`. See corpus catalog at [PyThaiNLP/pythainlp-corpus](https://github.com/PyThaiNLP/pythainlp-corpus).
3939

40+
4041
## Capabilities
4142

42-
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speec tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
43+
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
4344

4445
<details>
4546
<summary>List of Features</summary>
4647

4748
- Convenient character and word classes, like Thai consonants (`pythainlp.thai_consonants`), vowels (`pythainlp.thai_vowels`), digits (`pythainlp.thai_digits`), and stop words (`pythainlp.corpus.thai_stopwords`) -- comparable to constants like `string.letters`, `string.digits`, and `string.punctuation`
4849
- Thai linguistic unit segmentation/tokenization, including sentence (`sent_tokenize`), word (`word_tokenize`), and subword segmentations based on Thai Character Cluster (`subword_tokenize`)
49-
- Thai part-of-speech taggers (`pos_tag`)
50+
- Thai part-of-speech tagging (`pos_tag`)
5051
- Thai spelling suggestion and correction (`spell` and `correct`)
5152
- Thai transliteration (`transliterate`)
5253
- Thai soundex (`soundex`) with three engines (`lk82`, `udom83`, `metasound`)
53-
- Thai collation (sort by dictionoary order) (`collate`)
54+
- Thai collation (sort by dictionary order) (`collate`)
5455
- Read out number to Thai words (`bahttext`, `num_to_thaiword`)
5556
- Thai datetime formatting (`thai_strftime`)
5657
- Thai-English keyboard misswitched fix (`eng_to_thai`, `thai_to_eng`)
5758
- Command-line interface for basic functions, like tokenization and pos tagging (run `thainlp` in your shell)
5859
</details>
5960

60-
Please see [our tutorials](https://www.thainlp.org/pythainlp/tutorials) on how to apply these functions to ML problems.
61+
Please see [our tutorials](https://www.thainlp.org/pythainlp/tutorials) on how to apply these functions to machine-learning problems.
62+
6163

6264
## Installation
6365

@@ -66,7 +68,7 @@ pip install --upgrade pythainlp
6668
```
6769

6870
This will install the latest stable release of PyThaiNLP.
69-
PyThaiNLP uses pip as its package manger and PyPI as its main distribution channel, see [https://pypi.org/project/pythainlp/](https://pypi.org/project/pythainlp/)
71+
PyThaiNLP uses pip as its package manager and PyPI as its main distribution channel, see [https://pypi.org/project/pythainlp/](https://pypi.org/project/pythainlp/)
7072

7173
Install different releases:
7274

@@ -99,9 +101,9 @@ pip install pythainlp[extra1,extra2,...]
99101
For dependency details, look at `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).
100102

101103

102-
## Command-line
104+
## Command-Line Interface
103105

104-
Some of PyThaiNLP functionalities can be used at command line, using `thainlp`
106+
Some of PyThaiNLP functionalities can be used at command line, using `thainlp` command.
105107

106108
For example, displaying a catalog of datasets:
107109
```sh
@@ -121,6 +123,7 @@ thainlp help
121123
- [Upgrade ThaiNER from 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
122124
- Python 2.7 users can use PyThaiNLP 1.6
123125

126+
124127
## Citations
125128

126129
If you use `PyThaiNLP` in your project or publication, please cite the library as follows
@@ -148,6 +151,7 @@ or BibTeX entry:
148151
- Please do fork and create a pull request :)
149152
- For style guide and other information, including references to algorithms we use, please refer to our [contributing](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md) page.
150153

154+
151155
## Licenses
152156

153157
| | License |
@@ -157,6 +161,7 @@ or BibTeX entry:
157161
| Language models created by PyThaiNLP | [Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/) |
158162
| Other corpora and models that may included with PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |
159163

164+
160165
## Sponsors
161166

162167
[![VISTEC-depa Thailand Artificial Intelligence Research Institute](https://airesearch.in.th/assets/img/logo/airesearch-logo.svg)](https://airesearch.in.th/)
@@ -168,3 +173,13 @@ Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have
168173
<div align="center">
169174
Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭
170175
</div>
176+
177+
------
178+
179+
<div align="center">
180+
<strong>We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp </strong>
181+
</div>
182+
183+
<div align="center">
184+
<strong>Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.</strong>
185+
</div>

README_TH.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร
2222
2323
| รุ่น | คำอธิบาย | สถานะ |
2424
|:------:|:--:|:------:|
25-
| [2.2.3](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
25+
| [2.2.4](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
2626
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 2.3 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/445) |
2727

2828
ติดตามพวกเราบน [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) เพื่อรับข่าวสารเพิ่มเติม

docs/api/corpus.rst

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,21 @@ The :class:`pythainlp.corpus` provides access to corpus that comes with PyThaiNL
77
Modules
88
-------
99

10+
.. autofunction:: countries
1011
.. autofunction:: get_corpus
1112
.. autofunction:: get_corpus_db
1213
.. autofunction:: get_corpus_db_detail
1314
.. autofunction:: get_corpus_path
1415
.. autofunction:: download
1516
.. autofunction:: remove
16-
.. autofunction:: pythainlp.corpus.common.countries
17-
.. autofunction:: pythainlp.corpus.common.provinces
18-
.. autofunction:: pythainlp.corpus.common.thai_stopwords
19-
.. autofunction:: pythainlp.corpus.common.thai_words
20-
.. autofunction:: pythainlp.corpus.common.thai_syllables
21-
.. autofunction:: pythainlp.corpus.common.thai_negations
22-
.. autofunction:: pythainlp.corpus.common.thai_female_names
23-
.. autofunction:: pythainlp.corpus.common.thai_male_names
17+
.. autofunction:: provinces
18+
.. autofunction:: thai_stopwords
19+
.. autofunction:: thai_words
20+
.. autofunction:: thai_syllables
21+
.. autofunction:: thai_negations
22+
.. autofunction:: thai_family_names
23+
.. autofunction:: thai_female_names
24+
.. autofunction:: thai_male_names
2425
.. autofunction:: pythainlp.corpus.conceptnet.edges
2526

2627
TNC

docs/api/tag.rst

Lines changed: 53 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
pythainlp.tag
44
=====================================
5-
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text including
6-
Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.
5+
The :class:`pythainlp.tag` contains functions that are used to mark linguistic and other annotation to different parts of a text including
6+
part-of-speech (POS) tag and named entity (NE) tag.
77

8-
For the POS tags, there are two set of tags including `Universal Dependencies (UD) <https://universaldependencies.org/>`_ and ORCHID [#Sornlertlamvanich_2000]_ POS tags.
8+
For POS tags, there are three set of available tags: `Universal POS tags <https://universaldependencies.org/>`_, ORCHID POS tags [#Sornlertlamvanich_2000]_, and LST20 POS tags [#Prachya_2020]_.
99

10-
The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:
10+
The following table shows Universal POS tags as used in Universal Dependencies (UD):
1111

1212
============ ========================== =============================
1313
Abbreviation Part-of-Speech tag Examples
@@ -29,7 +29,7 @@ Abbreviation Part-of-Speech tag Examples
2929
VERB Verb เปิด, ให้, ใช้, เผชิญ, อ่าน
3030
============ ========================== =============================
3131

32-
The following table shows the list of Part-of-Speech (POS) tags according to ORCHID POS tags from the paper:
32+
The following table shows POS tags as used in ORCHID:
3333

3434
============ ================================================= =================================
3535
Abbreviation Part-of-Speech tag Examples
@@ -93,7 +93,7 @@ Abbreviation Part-of-Speech tag Examples
9393

9494
ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.
9595

96-
The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:
96+
The following table shows the mapping of POS tags from ORCHID to UD:
9797

9898
=============== =======================
9999
ORCHID POS tags Coresponding UD POS tag
@@ -161,15 +161,54 @@ PUNCT PUNCT
161161
PUNC PUNCT
162162
=============== =======================
163163

164-
For the NER, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NER for each words.
165-
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would be tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" as "B-PERSON", "I-PERSON", "I-PERSON", "O", and "O" respectively.
164+
Details about LST20 POS tags are available in [#Prachya_2020]_.
166165

167-
The *B-* prefix indicates begining token for a chunk of person name, "บารัค โอบามา" and *I-* prefix indicates the intermediate token. However, the term *O* indicates that a token not belong to any NER chunk.
166+
The following table shows the mapping of POS tags from LST20 to UD:
168167

169-
The following table shows the list of Named Entity Recognition (NER) tags:
168+
+----------------+-------------------------+
169+
| LST20 POS tags | Coresponding UD POS tag |
170+
+================+=========================+
171+
| AJ | ADJ |
172+
+----------------+-------------------------+
173+
| AV | ADV |
174+
+----------------+-------------------------+
175+
| AX | AUX |
176+
+----------------+-------------------------+
177+
| CC | CCONJ |
178+
+----------------+-------------------------+
179+
| CL | NOUN |
180+
+----------------+-------------------------+
181+
| FX | NOUN |
182+
+----------------+-------------------------+
183+
| IJ | INTJ |
184+
+----------------+-------------------------+
185+
| NN | NOUN |
186+
+----------------+-------------------------+
187+
| NU | NUM |
188+
+----------------+-------------------------+
189+
| PA | PART |
190+
+----------------+-------------------------+
191+
| PR | PROPN |
192+
+----------------+-------------------------+
193+
| PS | ADP |
194+
+----------------+-------------------------+
195+
| PU | PUNCT |
196+
+----------------+-------------------------+
197+
| VV | VERB |
198+
+----------------+-------------------------+
199+
| XX | X |
200+
+----------------+-------------------------+
201+
202+
For the NE, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NE for each word.
203+
204+
*B-* prefix indicates the begining token of the chunk. *I-* prefix indicates the intermediate token within the chunk. *O* indicates that the token does not belong to any NE chunk.
205+
206+
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" with "B-PERSON", "I-PERSON", "O", and "O" respectively.
207+
208+
The following table shows named entity (NE) tags as used PyThaiNLP:
170209

171210
============================ =================================
172-
Named Entity Recognition tag Examples
211+
Named Entity tag Examples
173212
============================ =================================
174213
DATE 2/21/2004, 16 ก.พ., จันทร์
175214
TIME 16.30 น., 5 วัน, 1-3 ปี
@@ -214,3 +253,6 @@ References
214253
.. [#Sornlertlamvanich_2000] Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. (2000).
215254
Building a Thai Part-Of-Speech Tagged Corpus (ORCHID).
216255
The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999.
256+
.. [#Prachya_2020] Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika Boriboon and Krit Kosawat and Thepchai Supnithi. (2020).
257+
The Annotation Guideline of LST20 Corpus.
258+
arXiv:2008.05055

docs/api/util.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Modules
1212
.. autofunction:: collate
1313
.. autofunction:: dict_trie
1414
.. autofunction:: digit_to_text
15+
.. autofunction:: display_thai_char
1516
.. autofunction:: eng_to_thai
1617
.. autofunction:: find_keyword
1718
.. autofunction:: countthai

pythainlp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# -*- coding: utf-8 -*-
2-
__version__ = "2.2.3"
2+
__version__ = "2.2.4"
33

44
thai_consonants = "กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ" # 44 chars
55

pythainlp/corpus/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
"get_corpus_path",
1919
"provinces",
2020
"remove",
21+
"thai_family_names",
2122
"thai_female_names",
2223
"thai_male_names",
2324
"thai_negations",
@@ -86,6 +87,7 @@ def corpus_db_path() -> str:
8687
from pythainlp.corpus.common import (
8788
countries,
8889
provinces,
90+
thai_family_names,
8991
thai_female_names,
9092
thai_male_names,
9193
thai_negations,

0 commit comments

Comments
 (0)