Skip to content

Commit ca27105

Browse files
authored
Merge pull request #325 from PyThaiNLP/dev
Update 2.1 branche from dev branche.
2 parents df4202c + fbc87a5 commit ca27105

File tree

12 files changed

+27
-60
lines changed

12 files changed

+27
-60
lines changed

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ PyThaiNLP is a Python package for text processing and linguistic analysis, simil
2424
**This is a document for development branch (post 2.0). Things will break.**
2525

2626
- The latest stable release is [2.0.7](https://github.com/PyThaiNLP/pythainlp/releases)
27-
- The latest development release is [2.1.dev7](https://github.com/PyThaiNLP/pythainlp/releases). See the ongoing [2.1 change log](https://github.com/PyThaiNLP/pythainlp/issues/181).
27+
- The latest development release is [2.1.dev8](https://github.com/PyThaiNLP/pythainlp/releases). See the ongoing [2.1 change log](https://github.com/PyThaiNLP/pythainlp/issues/181).
2828
- 📫 follow our [PyThaiNLP](https://www.facebook.com/pythainlp/) Facebook page
2929

3030

@@ -68,7 +68,6 @@ pip install pythainlp[extra1,extra2,...]
6868
```
6969

7070
where `extras` can be
71-
- `artagger` (to support artagger part-of-speech tagger)
7271
- `attacut` (to support attacut, a fast and accurate tokenizer)
7372
- `icu` (for ICU, International Components for Unicode, support in transliteration and tokenization)
7473
- `ipa` (for IPA, International Phonetic Alphabet, support in transliteration)
@@ -177,7 +176,6 @@ pip install pythainlp[extra1,extra2,...]
177176
```
178177

179178
โดยที่ `extras` คือ
180-
- `artagger` (สำหรับตัวติดป้ายกำกับชนิดคำ artagger)
181179
- `attacut` (ตัวตัดคำที่แม่นกว่า `newmm` เมื่อเทียบกับชุดข้อมูล BEST)
182180
- `icu` (สำหรับการถอดตัวสะกดเป็นสัทอักษรและการตัดคำด้วย ICU)
183181
- `ipa` (สำหรับการถอดตัวสะกดเป็นสัทอักษรสากล (IPA))

appveyor.docs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,8 @@ install:
4242
- export LD_LIBRARY_PATH=/usr/local/lib
4343
- sudo pip3 install -r requirements.txt
4444
- sudo pip3 install torch==1.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
45-
- sudo pip3 install --upgrade artagger emoji epitran gensim numpy pandas pyicu sklearn-crfsuite ssg
46-
- sudo pip3 install --upgrade "tensorflow==1.14,<2"deepcut
45+
- sudo pip3 install --upgrade emoji epitran gensim numpy pandas pyicu sklearn-crfsuite ssg
46+
- sudo pip3 install --upgrade "tensorflow>=2,<3"deepcut
4747
- sudo pip3 install --upgrade boto smart_open sphinx sphinx-rtd-theme
4848

4949
#---------------------------------#

appveyor.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ environment:
4848
PYTHONIOENCODING: "utf-8"
4949
ICU_VERSION: "64.2"
5050
DISTUTILS_USE_SDK: "1"
51-
ARTAGGER_PKG: "https://github.com/franziz/artagger/archive/master.zip"
5251
PYTHAINLP_DATA_DIR: "%LOCALAPPDATA%/pythainlp-data"
5352

5453
matrix:
@@ -101,7 +100,6 @@ install:
101100
- pip install "tensorflow>=2,<3" deepcut
102101
- pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
103102
- pip install %PYICU_PKG%
104-
- pip install %ARTAGGER_PKG%
105103
- pip install -e .[full]
106104

107105
#---------------------------------#

docs/api/tag.rst

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -207,14 +207,10 @@ unigram
207207

208208
Unigram tagger doesn't take the ordering of words in the list into account.
209209

210-
artagger
211-
++++++++
212-
213-
`artagger <https://github.com/franziz/artagger>`_ is an implementation of `RDRPOSTagger <https://github.com/datquocnguyen/RDRPOSTagger>`_ for tagging POS in Thai language.
214210

215211
References
216212
----------
217213

218214
.. [#Sornlertlamvanich_2000] Takahashi, Naoto & Isahara, Hitoshi & Sornlertlamvanich, Virach. (2000).
219215
Building a Thai part-of-speech tagged corpus (ORCHID).
220-
ournal of the Acoustical Society of Japan (E). 20. 10.1250/ast.20.189.
216+
Journal of the Acoustical Society of Japan (E). 20. 10.1250/ast.20.189.

docs/notes/installation.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ For some functionalities, like named entity recognition, extra packages may be n
1414
pip install pythainlp[extra1,extra2,...]
1515

1616
where ``extras`` can be
17-
- ``artagger`` (to support artagger part-of-speech tagger)
1817
- ``attacut`` (to support attacut, a fast and accurate tokenizer)
1918
- ``icu`` (for ICU, International Components for Unicode, support in transliteration and tokenization)
2019
- ``ipa`` (for IPA, International Phonetic Alphabet, support in transliteration)

pythainlp/cli/soundex.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
class App:
88

99
def __init__(self, argv):
10-
parser = argparse.ArgumentParser("sounddex")
10+
parser = argparse.ArgumentParser("soundex")
1111
parser.add_argument(
1212
"--text",
1313
type=str,

pythainlp/corpus/words_th.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15594,7 +15594,6 @@
1559415594
ชิงไหวชิงพริบ
1559515595
ชิงฮื้อ
1559615596
ชิชะ
15597-
ชิชิ
1559815597
ชิณณะ
1559915598
ชิด
1560015599
ชิดขวา

pythainlp/tag/__init__.py

Lines changed: 7 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -104,23 +104,14 @@ def _orchid_to_ud(tag) -> List[Tuple[str, str]]:
104104
_i = 0
105105
temp = []
106106
while _i < len(tag):
107-
temp.append((tag[_i][0], _UD_Exception(tag[_i][0], _TAG_MAP_UD[tag[_i][1]])))
107+
temp.append(
108+
(tag[_i][0], _UD_Exception(tag[_i][0], _TAG_MAP_UD[tag[_i][1]]))
109+
)
108110
_i += 1
109111

110112
return temp
111113

112114

113-
def _artagger_tag(words: List[str], corpus: str = None) -> List[Tuple[str, str]]:
114-
if not words:
115-
return []
116-
117-
from artagger import Tagger
118-
119-
words_ = Tagger().tag(" ".join(words))
120-
121-
return [(word.word, word.tag) for word in words_]
122-
123-
124115
def pos_tag(
125116
words: List[str], engine: str = "perceptron", corpus: str = "orchid"
126117
) -> List[Tuple[str, str]]:
@@ -132,7 +123,6 @@ def pos_tag(
132123
:param str engine:
133124
* *perceptron* - perceptron tagger (default)
134125
* *unigram* - unigram tagger
135-
* *artagger* - RDR POS tagger
136126
:param str corpus:
137127
* *orchid* - annotated Thai academic articles namedly
138128
`Orchid <https://www.academia.edu/9127599/Thai_Treebank>`_ (default)
@@ -145,10 +135,6 @@ def pos_tag(
145135
:return: returns a list of labels regarding which part of speech it is
146136
:rtype: list[tuple[str, str]]
147137
148-
:Note:
149-
* *artagger*, only support one sentence and the sentence must
150-
be tokenized beforehand.
151-
152138
:Example:
153139
154140
Tag words with corpus `orchid` (default)::
@@ -187,8 +173,7 @@ def pos_tag(
187173
# ('ใน', 'ADP'), ('อาคาร', 'NOUN'), ('หลบภัย', 'NOUN'),
188174
# ('ของ', 'ADP'), ('นายก', 'NOUN'), ('เชอร์ชิล', 'PROPN')]
189175
190-
Tag words with different engines including *perceptron*, *unigram*,
191-
and *artagger*::
176+
Tag words with different engines including *perceptron* and *unigram*::
192177
193178
from pythainlp.tag import pos_tag
194179
@@ -204,12 +189,6 @@ def pos_tag(
204189
# output:
205190
# [('เก้าอี้', None), ('มี', 'VERB'), ('จำนวน', 'NOUN'), ('ขา', None),
206191
# ('<space>', None), ('<equal>', None), ('3', 'NUM')]
207-
208-
pos_tag(words, engine='artagger', corpus='orchid')
209-
# output:
210-
# [('เก้าอี้', 'NCMN'), ('มี', 'VSTA'), ('จำนวน', 'NCMN'),
211-
# ('ขา', 'NCMN'), ('<space>', 'PUNC'),
212-
# ('<equal>', 'PUNC'), ('3', 'NCNM')]
213192
"""
214193

215194
# NOTE:
@@ -222,8 +201,6 @@ def pos_tag(
222201

223202
if engine == "perceptron":
224203
from .perceptron import tag as tag_
225-
elif engine == "artagger":
226-
tag_ = _artagger_tag
227204
else: # default, use "unigram" ("old") engine
228205
from .unigram import tag as tag_
229206
_tag = tag_(words, corpus=corpus)
@@ -235,7 +212,9 @@ def pos_tag(
235212

236213

237214
def pos_tag_sents(
238-
sentences: List[List[str]], engine: str = "perceptron", corpus: str = "orchid"
215+
sentences: List[List[str]],
216+
engine: str = "perceptron",
217+
corpus: str = "orchid",
239218
) -> List[List[Tuple[str, str]]]:
240219
"""
241220
The function tag multiple list of tokenized words into Part-of-Speech
@@ -245,7 +224,6 @@ def pos_tag_sents(
245224
:param str engine:
246225
* *perceptron* - perceptron tagger (default)
247226
* *unigram* - unigram tagger
248-
* *artagger* - RDR POS tagger
249227
:param str corpus:
250228
* *orchid* - annotated Thai academic articles namedly\
251229
`Orchid <https://www.academia.edu/9127599/Thai_Treebank>`_\

pythainlp/tokenize/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -418,7 +418,7 @@ class Tokenizer:
418418
# 'ผิดปกติ', 'ของ', 'การ', 'พูด']
419419
420420
Tokenizer object instantiated with a file path containing list of
421-
word separated with *newline* and explicitly set a new tokeneizer
421+
word separated with *newline* and explicitly set a new tokenizer
422422
after initiation::
423423
424424
PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt'

setup.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,7 @@
4343
]
4444

4545
extras = {
46-
"artagger": ["artagger>=0.1.0.3"],
47-
"attacut": ["attacut>=1.0.4"],
46+
"attacut": ["attacut>=1.0.6"],
4847
"benchmarks": ["numpy>=1.16", "pandas>=0.24"],
4948
"icu": ["pyicu>=2.3"],
5049
"ipa": ["epitran>=1.1"],
@@ -54,7 +53,6 @@
5453
"thai2fit": ["emoji>=0.5.1", "gensim>=3.2.0", "numpy>=1.16"],
5554
"thai2rom": ["torch>=1.0.0", "numpy>=1.16"],
5655
"full": [
57-
"artagger>=0.1.0.3",
5856
"attacut>=1.0.4",
5957
"emoji>=0.5.1",
6058
"epitran>=1.1",

0 commit comments

Comments
 (0)