Skip to content

Commit eb09bc9

Browse files
authored
Merge branch 'dev' into ner-tag
2 parents 937aecb + 1d26919 commit eb09bc9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+5012
-2806
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,4 +82,4 @@ docs/_build/doctrees/api/
8282

8383
docs/_build/html/
8484

85-
docs/_build/doctrees/
85+
docs/_build/doctrees/

.travis.yml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,9 @@ script: coverage run --source=pythainlp setup.py test
1515
after_success: coveralls
1616
deploy:
1717
provider: pypi
18-
distributions: sdist bdist_wheel
18+
distributions: bdist_wheel
1919
user: wannaphong
2020
password:
21-
secure: zX35+8niw5W9H8XbFwacrDAhqyIibdUdC/cARnHlmxLN/2H9IynK0NW04UZwkBlrwrIZrU/g+cqYXFQXu6jE1ozlBKBxUd3xG8d1kixuntI0j9e+erPTs8Ju/KazUZtlknJPvnDMP+/1Dq+RMnMCP3RRlBrH6lvG70OgZ1aBpgx8FxRfs0xHfBIZvo5CVtR/QlDzhDJM1cgEyWkSgnlAhPxpv8qIQbh4/Rw89jXIZqv0bGCVJorrrcTA1oCzkr/4E4u/WZaARnvPjUr2a9U1w7C2IysDHiBfqQWlovdMmpoSLFE56YlG3smbmXfldWjmiMRQoWL+Ifu+smisvOLmR0ja78UMrrhHWP4mdzIeBVVRnT6eHUv0ChmLT2uCkOLE0newhtEJIYToot2TSoLFavXXIQB1fIHt6e74KRTV6WGnm0nFfHuGP+b5SgSPQFgqx8tBpn0rBOeqZ1y3pRISc/drF0F4reWMnlqoQfZZFmLmU1UmDZbvWNvXPu6MWyyuZ1F6fE9jyb3mG+kDuJf1PZ4ejC/sdIvpLlwUGLFGzRMa2TtxXqGq5CWsywPxo8Sx+bpMPCOImuW60PB9K/xKgfLhAtb7gZwndzUGqDbtSJCd5PmTkfEH8fawv/XnydvsssYUpipBCmFDZlNREyAkgOcLlL099Y5fAO8l2gOLyKs=
21+
secure: Tj3VA05qopp0mkzWu6EFTlvijAoisd0BN/gD2c/vaaDCUy6fTXBkYk+dTkjbmYkEBl/WrsrW1T/QxCt2uc6bv7QTz+qL243Edv4FFQbBKvMSNlUO+hh1jI9zv3/QzwOaNHXOsI4JGeUaN5cULfxBjsBEFN+v6E0mkgBwJ0Qdb0/yuMybLWZ9dJI8iUKiaWNIr+NQoa9a+Sxw6Ltl/mdCKPppgOYPpVMCsDDdLqZdjkgXmzsjH9+Nfe6R+mYbdmeigy3ePNsIKbPkzZrY+E/I0lPZOVUgrs6gvZwlD3gESJgTROrUH6E2lBP9yYvFUE3KB0O+rdT5GyFq3MH1uD2ocrPCTQku6577wK31FzGoex6vtT4y2b39fLbhRmZDOJW8IFO7MLybazuRsNhaXn9hQU4HBRM2GQZc41bLkiEhsUX9/b2ujcn4PJKDZy91LnBw/93bgZJ7KweDzKywmcZSNeuBsGWgXdPqYiizzcf8DdvJAYytydhf8RxqdemTiS7GE7XBoXhj1/9Vfrt3lZXZbfYpTjNZeyxu7FrUJpm/I23wCw46qaRWzKXv2sRRUleNqQ1jIKEVupIa9sruHvG7DZecErhO9rMkGdsf4CIjolZ0A2BE+eAPEEY6/H1WFUWHxzxuELbUJwxnl1By677hBkLJaVs1YMGc2enGWzOnUYI=
2222
on:
23-
tags: true
24-
repo: pythainlp/pythainlp
23+
tags: true

README-pypi.md

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,15 @@ PyThaiNLP includes Thai word tokenizers, transliterators, soundex converters, pa
88

99
📫 follow us on Facebook [PyThaiNLP](https://www.facebook.com/pythainlp/)
1010

11-
## What's new in 2.0 ?
11+
## What's new in 2.1 ?
1212

13-
- Terminate Python 2 support. Remove all Python 2 compatibility code.
1413
- Improved `word_tokenize` ("newmm" and "mm" engine), a `custom_dict` dictionary can be provided
15-
- Improved `pos_tag` Part-Of-Speech tagging
16-
- New `NorvigSpellChecker` spell checker class, which can be initialized with custom dictionary.
17-
- New `thai2fit` (replacing `thai2vec`, upgrade ULMFiT-related code to fastai 1.0)
18-
- Updated ThaiNER to 1.0
19-
- You may need to [update your existing ThaiNER models from PyThaiNLP 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
20-
- Remove old, obsolated, deprecated, duplicated, and experimental code.
21-
- Sentiment analysis is no longer part of the library, but rather [a text classification example](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/sentiment_analysis.ipynb).
14+
- Add AttaCut to be options for `word_tokenize` engine.
15+
- New Thai2rom (PyTorch)
16+
- New Command Line
17+
- Add word tokenization benchmark to PyThaiNLP
2218
- See more examples in [Get Started notebook](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb)
23-
- [Full change log](https://github.com/PyThaiNLP/pythainlp/issues/118)
24-
- [Upgrading from 1.7](https://thainlp.org/pythainlp/docs/2.0/notes/pythainlp-1_7-2_0.html)
19+
- [Full change log](https://github.com/PyThaiNLP/pythainlp/issues/181)
2520

2621
## Install
2722

@@ -40,6 +35,7 @@ pip install pythainlp[extra1,extra2,...]
4035
where extras can be
4136

4237
- `artagger` (to support artagger part-of-speech tagger)*
38+
- `attacut` - Wrapper for AttaCut (https://github.com/PyThaiNLP/attacut)
4339
- `deepcut` (to support deepcut machine-learnt tokenizer)
4440
- `icu` (for ICU support in transliteration and tokenization)
4541
- `ipa` (for International Phonetic Alphabet support in transliteration)
@@ -54,8 +50,15 @@ Install it with pip, for example: `pip install marisa_trie‑0.7.5‑cp36‑cp36
5450

5551
## Links
5652

57-
- User guide: [English](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb), [ภาษาไทย](https://colab.research.google.com/drive/1rEkB2Dcr1UAKPqz4bCghZV7pXx2qxf89)
58-
- Docs: https://thainlp.org/pythainlp/docs/2.0/
53+
- User guide: [English](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb)
54+
- Docs: https://thainlp.org/pythainlp/docs/2.1/
5955
- GitHub: https://github.com/PyThaiNLP/pythainlp
6056
- Issues: https://github.com/PyThaiNLP/pythainlp/issues
6157
- Facebook: [PyThaiNLP](https://www.facebook.com/pythainlp/)
58+
59+
60+
Made with ❤️
61+
62+
We build Thai NLP.
63+
64+
PyThaiNLP Team.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Made with ❤️
113113

114114
We build Thai NLP.
115115

116-
PyThaiNLP team.
116+
PyThaiNLP Team.
117117

118118
# ภาษาไทย
119119

appveyor.docs.yml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
image: ubuntu1604
2+
3+
branches:
4+
only:
5+
- /2.*/
6+
- dev
7+
8+
skip_commits:
9+
message: /(skip ci docs)/ # Skip a new build if message contains '(skip ci docs)'
10+
11+
install:
12+
- sudo add-apt-repository ppa:jonathonf/python-3.6 -y
13+
- sudo apt-get update
14+
- sudo apt install -y python3.6
15+
- sudo apt install -y python3.6-dev
16+
- sudo apt install -y python3.6-venv
17+
- wget https://bootstrap.pypa.io/get-pip.py
18+
- sudo python3.6 get-pip.py
19+
- sudo ln -s /usr/bin/python3.6 /usr/local/bin/python
20+
- sudo apt-get install -y pandoc libicu-dev
21+
- python -V
22+
- python3 -V
23+
- pip -V
24+
- sudo pip install -r requirements.txt
25+
- export LD_LIBRARY_PATH=/usr/local/lib
26+
- sudo pip install torch==1.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
27+
- sudo pip install sphinx sphinx-rtd-theme typing artagger deepcut epitran keras numpy pyicu sklearn-crfsuite tensorflow ssg emoji pandas
28+
- sudo pip install --upgrade gensim smart_open boto
29+
30+
# configuration for deploy mode, commit message with /(build and deloy docs)/
31+
# 1. build documents and upload HTML files to Appveyor's storage
32+
# 2. upload to thainlp.org/pythainlp/docs/<brnach_bane>
33+
34+
only_commits:
35+
message: /(build and deploy docs)/
36+
37+
build_script:
38+
- cd ./docs
39+
- export CURRENT_BRANCH=$APPVEYOR_REPO_BRANCH
40+
- export RELEASE=$(git describe --tags --always)
41+
- export RELEASE=$(echo $RELEASE | cut -d'-' -f1)
42+
- export TODAY=$(date +'%Y-%m-%d')
43+
- make html
44+
- echo "Done building HTML files for the branch -- $APPVEYOR_REPO_BRANCH"
45+
- echo "Start cleaning the directory /docs/$APPVEYOR_REPO_BRANCH"
46+
- sudo bash ./clean_directory.sh $FTP_USER $FTP_PASSWORD $FTP_HOST $APPVEYOR_REPO_BRANCH
47+
- echo "Start Uploading files to thainlp.org/pythainlp/docs/$APPVEYOR_REPO_BRANCH"
48+
- cd ./_build/html
49+
- echo "cd to ./build/html"
50+
- find . -type f -name "*" -print -exec curl --ftp-create-dir --ipv4 -T {} ftp://${FTP_USER}:${FTP_PASSWORD}@${FTP_HOST}/public_html/pythainlp/docs/$APPVEYOR_REPO_BRANCH/{} \;
51+
- echo "Done uploading"
52+
- echo "Done uploading files to -- thainlp.org/pythainlp/docs/$APPVEYOR_REPO_BRANCH"
53+
54+
artifacts:
55+
- path: ./docs/_build/html
56+
name: document
57+
58+
after_build:
59+
- echo "Done build and deploy"
60+
- appveyor exit
61+
62+
test: off

bin/pythainlp

100644100755
Lines changed: 27 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,31 @@
1-
#!python3
1+
#!/usr/bin/env python
22
# -*- coding: utf-8 -*-
33

44
import argparse
5-
from pythainlp import __version__
6-
parser = argparse.ArgumentParser()
7-
parser.add_argument("-t", "--text", default=None, help="text", type=str)
8-
parser.add_argument("-seg", "--segment", help="word segment", action="store_true")
9-
parser.add_argument("-c", "--corpus", help="mange corpus", action="store_true")
10-
parser.add_argument("-pos", "--postag", help="postag", action="store_true")
11-
parser.add_argument("-soundex", "--soundex", help="soundex", default=None)
12-
parser.add_argument("-e", "--engine", default="newmm", help="the engine", type=str)
13-
parser.add_argument("-pos-e", "--postag_engine", default="perceptron", help="the engine for word tokenize", type=str)
14-
parser.add_argument("-pos-c", "--postag_corpus", default="orchid", help="corpus for postag", type=str)
15-
args = parser.parse_args()
16-
17-
if args.corpus:
18-
from pythainlp.corpus import *
19-
print("PyThaiNLP Corpus")
20-
temp=""
21-
while temp!="exit":
22-
print("\n1. Install\n2. Remove\n3. Update\n4. Exit\n")
23-
temp=input("Choose 1, 2, 3, or 4: ")
24-
if temp=="1":
25-
name=input("Corpus name:")
26-
download(name)
27-
elif temp=="2":
28-
name=input("Corpus name:")
29-
remove(name)
30-
elif temp=="3":
31-
name=input("Corpus name:")
32-
download(name)
33-
elif temp=="4":
34-
break
35-
else:
36-
print("Choose 1, 2, 3, or 4:")
37-
elif args.text!=None:
38-
from pythainlp.tokenize import word_tokenize
39-
tokens=word_tokenize(args.text, engine=args.engine)
40-
if args.segment:
41-
print("|".join(tokens))
42-
elif args.postag:
43-
from pythainlp.tag import pos_tag
44-
print("\t".join([i[0]+"/"+i[1] for i in pos_tag(tokens, engine=args.postag_engine, corpus=args.postag_corpus)]))
45-
elif args.soundex!=None:
46-
from pythainlp.soundex import soundex
47-
if args.engine=="newmm":
48-
args.engine="lk82"
49-
print(soundex(args.soundex, engine=args.engine))
5+
import sys
6+
7+
from pythainlp import cli
8+
9+
10+
parser = argparse.ArgumentParser(
11+
usage="pythainlp namespace command [options]"
12+
)
13+
14+
parser.add_argument(
15+
"namespace",
16+
type=str,
17+
default="",
18+
nargs="?",
19+
help="[%s]" % "|".join(cli.available_namespaces)
20+
)
21+
22+
args = parser.parse_args(sys.argv[1:2])
23+
24+
cli.exit_if_empty(args.namespace, parser)
25+
26+
if hasattr(cli, args.namespace):
27+
namespace = getattr(cli, args.namespace)
28+
namespace.App(sys.argv)
5029
else:
51-
print(f"PyThaiNLP {__version__}")
30+
print(f"Namespace not available: {args.namespace}\nPlease run with --help for alternatives")
31+

bin/word-tokenization-benchmark

Lines changed: 50 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,121 +1,115 @@
11
#!/usr/bin/env python3
22
# -*- coding: utf-8 -*-
33

4+
import argparse
45
import json
56
import os
6-
import argparse
7-
import yaml
87

9-
from pythainlp.benchmarks import word_tokenisation
8+
import yaml
9+
from pythainlp.benchmarks import word_tokenization
1010

1111
parser = argparse.ArgumentParser(
1212
description="Script for benchmarking tokenizaiton results"
1313
)
1414

1515
parser.add_argument(
16-
"--input",
16+
"--input-file",
1717
action="store",
18-
help="path to file that you want to compare against the test file"
18+
help="Path to input file to compare against the test file",
1919
)
2020

2121
parser.add_argument(
2222
"--test-file",
2323
action="store",
24-
help="path to test file"
24+
help="Path to test file i.e. ground truth",
2525
)
2626

2727
parser.add_argument(
2828
"--save-details",
2929
default=False,
30-
action='store_true',
31-
help="specify whether to save the details of comparisons"
30+
action="store_true",
31+
help="Save comparison details to files (eval-XXX.json and eval-details-XXX.json)",
3232
)
3333

3434
args = parser.parse_args()
3535

36+
3637
def _read_file(path):
3738
with open(path, "r", encoding="utf-8") as f:
3839
lines = map(lambda r: r.strip(), f.readlines())
3940
return list(lines)
4041

4142

42-
print(args.input)
43-
actual = _read_file(args.input)
43+
print(args.input_file)
44+
actual = _read_file(args.input_file)
4445
expected = _read_file(args.test_file)
4546

46-
assert len(actual) == len(expected), \
47-
'Input and test files do not have the same number of samples'
48-
print('Benchmarking %s against %s with %d samples in total' % (
49-
args.input, args.test_file, len(actual)
50-
))
51-
52-
df_raw = word_tokenisation.benchmark(expected, actual)
53-
54-
df_res = df_raw\
55-
.describe()
56-
df_res = df_res[[
57-
'char_level:tp',
58-
'char_level:tn',
59-
'char_level:fp',
60-
'char_level:fn',
61-
'char_level:precision',
62-
'char_level:recall',
63-
'char_level:f1',
64-
'word_level:precision',
65-
'word_level:recall',
66-
'word_level:f1',
67-
]]
47+
assert len(actual) == len(
48+
expected
49+
), "Input and test files do not have the same number of samples"
50+
print(
51+
"Benchmarking %s against %s with %d samples in total"
52+
% (args.input_file, args.test_file, len(actual))
53+
)
54+
55+
df_raw = word_tokenization.benchmark(expected, actual)
56+
57+
df_res = df_raw.describe()
58+
df_res = df_res[
59+
[
60+
"char_level:tp",
61+
"char_level:tn",
62+
"char_level:fp",
63+
"char_level:fn",
64+
"char_level:precision",
65+
"char_level:recall",
66+
"char_level:f1",
67+
"word_level:precision",
68+
"word_level:recall",
69+
"word_level:f1",
70+
]
71+
]
6872

6973
df_res = df_res.T.reset_index(0)
7074

71-
df_res['mean±std'] = df_res.apply(
72-
lambda r: '%2.2f±%2.2f' % (r['mean'], r['std']),
73-
axis=1
75+
df_res["mean±std"] = df_res.apply(
76+
lambda r: "%2.2f±%2.2f" % (r["mean"], r["std"]), axis=1
7477
)
7578

76-
df_res['metric'] = df_res['index']
79+
df_res["metric"] = df_res["index"]
7780

7881
print("============== Benchmark Result ==============")
79-
print(df_res[['metric', 'mean±std', 'min', 'max']].to_string(index=False))
80-
82+
print(df_res[["metric", "mean±std", "min", "max"]].to_string(index=False))
8183

8284

8385
if args.save_details:
8486
data = {}
85-
for r in df_res.to_dict('records'):
86-
metric = r['index']
87-
del r['index']
87+
for r in df_res.to_dict("records"):
88+
metric = r["index"]
89+
del r["index"]
8890
data[metric] = r
8991

90-
dir_name = os.path.dirname(args.input)
91-
file_name = args.input.split("/")[-1].split(".")[0]
92+
dir_name = os.path.dirname(args.input_file)
93+
file_name = args.input_file.split("/")[-1].split(".")[0]
9294

9395
res_path = "%s/eval-%s.yml" % (dir_name, file_name)
9496
print("Evaluation result is saved to %s" % res_path)
9597

96-
with open(res_path, 'w') as outfile:
98+
with open(res_path, "w", encoding="utf-8") as outfile:
9799
yaml.dump(data, outfile, default_flow_style=False)
98100

99101
res_path = "%s/eval-details-%s.json" % (dir_name, file_name)
100102
print("Details of comparisons is saved to %s" % res_path)
101103

102-
with open(res_path, "w") as f:
104+
with open(res_path, "w", encoding="utf-8") as f:
103105
samples = []
104106
for i, r in enumerate(df_raw.to_dict("records")):
105107
expected, actual = r["expected"], r["actual"]
106108
del r["expected"]
107109
del r["actual"]
108110

109-
samples.append(dict(
110-
metrics=r,
111-
expected=expected,
112-
actual=actual,
113-
id=i
114-
))
115-
116-
details = dict(
117-
metrics=data,
118-
samples=samples
119-
)
111+
samples.append(dict(metrics=r, expected=expected, actual=actual, id=i))
112+
113+
details = dict(metrics=data, samples=samples)
120114

121115
json.dump(details, f, ensure_ascii=False)

0 commit comments

Comments
 (0)