Skip to content

Commit 4aaf0c2

Browse files
authored
Merge pull request #321 from PyThaiNLP/dev
Merge changes from latest dev
2 parents bdc4a8e + cb27a35 commit 4aaf0c2

File tree

13 files changed

+166
-154
lines changed

13 files changed

+166
-154
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ before_install:
1818
- sudo rm -f /etc/boto.cfg
1919

2020
install:
21-
- pip install "tensorflow>=1.14,<2" deepcut
21+
- pip install "tensorflow>=2,<3" deepcut
2222
- pip install -r requirements.txt
2323
- pip install .[full]
2424
- pip install coveralls

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
[![Build Status](https://travis-ci.org/PyThaiNLP/pythainlp.svg?branch=develop)](https://travis-ci.org/PyThaiNLP/pythainlp)
1111
[![Build status](https://ci.appveyor.com/api/projects/status/9g3mfcwchi8em40x?svg=true)](https://ci.appveyor.com/project/wannaphongcom/pythainlp-9y1ch)
1212
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/cb946260c87a4cc5905ca608704406f7)](https://www.codacy.com/app/pythainlp/pythainlp_2?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=PyThaiNLP/pythainlp&amp;utm_campaign=Badge_Grade)
13-
[![Coverage Status](https://coveralls.io/repos/github/PyThaiNLP/pythainlp/badge.svg?branch=dev)](https://coveralls.io/github/PyThaiNLP/pythainlp?branch=dev) [![Google Colab Badge](https://badgen.net/badge/Launch%20Quick%20Start%20Guide/on%20Google%20Colab/blue?icon=terminal)](https://colab.research.google.com/github/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp-get-started.ipynb)
13+
[![Coverage Status](https://coveralls.io/repos/github/PyThaiNLP/pythainlp/badge.svg?branch=dev)](https://coveralls.io/github/PyThaiNLP/pythainlp?branch=dev) [![Google Colab Badge](https://badgen.net/badge/Launch%20Quick%20Start%20Guide/on%20Google%20Colab/blue?icon=terminal)](https://colab.research.google.com/github/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp_get_started.ipynb)
1414
[![DOI](https://zenodo.org/badge/61813823.svg)](https://zenodo.org/badge/latestdoi/61813823)
1515

1616
Thai Natural Language Processing in Python.
@@ -24,7 +24,7 @@ PyThaiNLP is a Python package for text processing and linguistic analysis, simil
2424
**This is a document for development branch (post 2.0). Things will break.**
2525

2626
- The latest stable release is [2.0.7](https://github.com/PyThaiNLP/pythainlp/releases)
27-
- The latest development release is [2.1.dev7](https://github.com/PyThaiNLP/pythainlp/releases). See [2.1 change log](https://github.com/PyThaiNLP/pythainlp/issues/181).
27+
- The latest development release is [2.1.dev7](https://github.com/PyThaiNLP/pythainlp/releases). See the ongoing [2.1 change log](https://github.com/PyThaiNLP/pythainlp/issues/181).
2828
- 📫 follow our [PyThaiNLP](https://www.facebook.com/pythainlp/) Facebook page
2929

3030

@@ -89,7 +89,7 @@ The data location can be changed, using `PYTHAINLP_DATA_DIR` environment variabl
8989

9090
## Documentation
9191

92-
- [PyThaiNLP Get Started notebook](https://github.com/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp-get-started.ipynb)
92+
- [PyThaiNLP Get Started](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html)
9393
- More tutorials at [https://www.thainlp.org/pythainlp/tutorials/](https://www.thainlp.org/pythainlp/tutorials/)
9494
- See full documentation at [https://thainlp.org/pythainlp/docs/2.0/](https://thainlp.org/pythainlp/docs/2.0/)
9595

@@ -198,7 +198,7 @@ pip install pythainlp[extra1,extra2,...]
198198

199199
## เอกสารการใช้งาน
200200

201-
- [notebook เริ่มต้นใช้งาน PyThaiNLP](https://github.com/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp-get-started.ipynb)
201+
- [เริ่มต้นใช้งาน PyThaiNLP](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html)
202202
- สอนการใช้งานเพิ่มเติม ในรูปแบบ notebook [https://www.thainlp.org/pythainlp/tutorials/](https://www.thainlp.org/pythainlp/tutorials/)
203203
- เอกสารตัวเต็ม [https://thainlp.org/pythainlp/docs/2.0/](https://thainlp.org/pythainlp/docs/2.0/)
204204

appveyor.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,8 @@ install:
9898
- pip --version
9999
- pip install coveralls[yaml]
100100
- pip install coverage
101-
- pip install "tensorflow>=1.14,<2" deepcut
102-
- pip install torch==1.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
101+
- pip install "tensorflow>=2,<3" deepcut
102+
- pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
103103
- pip install %PYICU_PKG%
104104
- pip install %ARTAGGER_PKG%
105105
- pip install -e .[full]

docs/notes/pythainlp-1_7-2_0.rst

Lines changed: 0 additions & 96 deletions
This file was deleted.

pythainlp/corpus/__init__.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
_CORPUS_DB_URL = (
1919
"https://raw.githubusercontent.com/"
2020
+ "PyThaiNLP/pythainlp-corpus/"
21-
+ "master/db.json"
21+
+ "2.1/db.json"
2222
)
2323

2424
_CORPUS_DB_FILENAME = "db.json"
@@ -165,12 +165,12 @@ def _check_hash(dst: str, md5: str) -> NoReturn:
165165
@param: md5 place to hash the file (MD5)
166166
"""
167167
if md5 and md5 != "-":
168-
f = open(get_full_data_path(dst), "rb")
169-
content = f.read()
170-
file_md5 = hashlib.md5(content).hexdigest()
168+
with open(get_full_data_path(dst), "rb") as f:
169+
content = f.read()
170+
file_md5 = hashlib.md5(content).hexdigest()
171171

172-
if md5 != file_md5:
173-
raise Exception("Hash does not match expected.")
172+
if md5 != file_md5:
173+
raise Exception("Hash does not match expected.")
174174

175175

176176
def download(name: str, force: bool = False) -> NoReturn:

pythainlp/tag/named_entity.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,10 +76,10 @@ def __init__(self):
7676
"""
7777
Thai named-entity recognizer
7878
"""
79-
self.__data_path = get_corpus_path("thainer-1-2")
79+
self.__data_path = get_corpus_path("thainer-1-3")
8080
if not self.__data_path:
81-
download("thainer-1-2")
82-
self.__data_path = get_corpus_path("thainer-1-2")
81+
download("thainer-1-3")
82+
self.__data_path = get_corpus_path("thainer-1-3")
8383
self.crf = sklearn_crfsuite.CRF(
8484
algorithm="lbfgs",
8585
c1=0.1,

pythainlp/tokenize/__init__.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ def word_tokenize(
3333
**Options for engine**
3434
* *newmm* (default) - dictionary-based, Maximum Matching +
3535
Thai Character Cluster
36+
* *newmm-safe* - newmm, with a mechanism to avoid long
37+
processing time for some long continuous text without spaces
3638
* *longest* - dictionary-based, Longest Matching
3739
* *icu* - wrapper for ICU (International Components for Unicode,
3840
using PyICU), dictionary-based
@@ -101,10 +103,15 @@ def word_tokenize(
101103
return []
102104

103105
segments = []
106+
104107
if engine == "newmm" or engine == "onecut":
105108
from .newmm import segment
106109

107110
segments = segment(text, custom_dict)
111+
elif engine == "newmm-safe":
112+
from .newmm import segment
113+
114+
segments = segment(text, custom_dict, safe_mode=True)
108115
elif engine == "attacut":
109116
from .attacut import segment
110117

@@ -157,6 +164,7 @@ def dict_word_tokenize(
157164
:param bool keep_whitespace: True to keep whitespaces, a common mark
158165
for end of phrase in Thai
159166
:return: list of words
167+
:rtype: list[str]
160168
"""
161169
warnings.warn(
162170
"dict_word_tokenize is deprecated. Use word_tokenize with a custom_dict argument instead.",
@@ -336,6 +344,7 @@ def syllable_tokenize(text: str, engine: str = "default") -> List[str]:
336344
tokens.extend(word_tokenize(text=word, custom_dict=trie))
337345
else:
338346
from .ssg import segment
347+
339348
tokens = segment(text)
340349

341350
return tokens
@@ -345,9 +354,10 @@ def dict_trie(dict_source: Union[str, Iterable[str], Trie]) -> Trie:
345354
"""
346355
Create a dictionary trie which will be used for word_tokenize() function.
347356
348-
:param string/list dict_source: a list of vocaburaries or a path
349-
to source file
350-
:return: a trie created from a dictionary input
357+
:param str|Iterable[str]|pythainlp.tokenize.Trie dict_source: a path to
358+
dictionary file or a list of words or a pythainlp.tokenize.Trie object
359+
:return: a trie object created from a dictionary input
360+
:rtype: pythainlp.tokenize.Trie
351361
"""
352362
trie = None
353363

@@ -359,7 +369,9 @@ def dict_trie(dict_source: Union[str, Iterable[str], Trie]) -> Trie:
359369
_vocabs = f.read().splitlines()
360370
trie = Trie(_vocabs)
361371
elif isinstance(dict_source, Iterable):
362-
# Note: Trie and str are both Iterable, Iterable check should be here
372+
# Note: Since Trie and str are both Iterable,
373+
# so the Iterable check should be here, at the very end,
374+
# because it has less specificality
363375
# Received a sequence type object of vocabs
364376
trie = Trie(dict_source)
365377
else:
@@ -435,7 +447,9 @@ class Tokenizer:
435447
"""
436448

437449
def __init__(
438-
self, custom_dict: Union[Trie, Iterable[str], str] = None, engine: str = "newmm"
450+
self,
451+
custom_dict: Union[Trie, Iterable[str], str] = None,
452+
engine: str = "newmm",
439453
):
440454
"""
441455
Initialize tokenizer object
@@ -458,7 +472,9 @@ def word_tokenize(self, text: str) -> List[str]:
458472
:return: list of words, tokenized from the text
459473
:rtype: list[str]
460474
"""
461-
return word_tokenize(text, custom_dict=self.__trie_dict, engine=self.__engine)
475+
return word_tokenize(
476+
text, custom_dict=self.__trie_dict, engine=self.__engine
477+
)
462478

463479
def set_tokenize_engine(self, engine: str) -> None:
464480
"""

pythainlp/tokenize/etcc.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# -*- coding: utf-8 -*-
22
"""
3-
Enhanced Thai Character Cluster (ETCC)
3+
Enhanced Thai Character Cluster (ETCC) (In progress)
44
Python implementation by Wannaphong Phatthiyaphaibun (19 June 2017)
55
66
:See Also:
@@ -75,5 +75,4 @@ def segment(text: str) -> str:
7575
text = re.sub(i, ii + "/", text)
7676

7777
text = re.sub("//", "/", text)
78-
7978
return text.split("/")

0 commit comments

Comments
 (0)