Skip to content

Feature/3.13 support #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 171 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
171 commits
Select commit Hold shift + click to select a range
c60f446
modified .gitignore and readme and speed test
rumbc Nov 28, 2020
9f58fe8
added type annotations and better dict iterations
rumbc Dec 2, 2020
291d29f
upgrade setup
icfly2 Dec 11, 2020
09823d4
Create python-package.yml
icfly2 Dec 11, 2020
7da301d
Create python-publish.yml
icfly2 Dec 11, 2020
3f92f7b
skip mongo test
icfly2 Dec 11, 2020
16d87e4
Update python-package.yml
icfly2 Dec 11, 2020
1704ed9
updated tests
icfly2 Dec 11, 2020
9ed76b4
updating tests
icfly2 Dec 11, 2020
4464a67
removed broken tests
icfly2 Dec 11, 2020
290fe05
Update python-package.yml
icfly2 Dec 11, 2020
7164b67
added more tests
icfly2 Dec 11, 2020
fdc100a
Merge branch 'master' of github.com:icfly2/simstring-fast
icfly2 Dec 11, 2020
2294648
moved sentinal char
icfly2 Dec 11, 2020
fa173de
clena up
icfly2 Dec 11, 2020
142ba6d
got rid of some sorted call
icfly2 Dec 11, 2020
5dbe619
small refactor
icfly2 Dec 11, 2020
016ff2f
Fixed bug in overlap_join. Fixed ranked search. Fixed issue for low t…
Nov 3, 2021
104ab2a
removed old lines
Nov 8, 2021
a1fe3d6
Formatting
May 18, 2022
7679b09
more tests
rumbc May 24, 2022
076f453
add type hints where missing
rumbc May 24, 2022
3547d62
disabled ranked search as broken
rumbc May 24, 2022
cde2727
mypycify and improved benchmark
rumbc May 24, 2022
87ad4cc
basic overlap feature implimented
rumbc May 24, 2022
abbb078
Merge pull request #8 from icfly2/add_overlap_measure
icfly2 May 24, 2022
a3ab01e
fixed badges and readme text a bit
rumbc May 24, 2022
ae1235e
replaced map with list comprehension and added types and further testing
rumbc May 25, 2022
d8fdc81
drop 3.5 as not supporting type hints, and also out dated
rumbc May 25, 2022
13783d0
Merge pull request #9 from icfly2/add_mypyc
icfly2 May 25, 2022
c97f426
Merge pull request #10 from icfly2/update_readme
icfly2 May 25, 2022
dd00213
added left overlap and some more tests, also changed minimum size for…
icfly2 Jun 1, 2022
64b2d19
Merge branch 'initial_fixes' of https://github.com/xsimw/xsimwstring …
rumbc Jun 2, 2022
d82f62d
changes ranked signature back
rumbc Jun 2, 2022
d7b6cc8
testing on company names
rumbc Jun 2, 2022
146b441
Merge pull request #11 from icfly2/better_overlap
icfly2 Jun 2, 2022
6f4cb6e
Merge branch 'xsimw-initial_fixes' into merge-xsimw
rumbc Jun 2, 2022
c5bf959
mypycify again
rumbc Jun 2, 2022
b8edd3d
tests pass, added type annotations
rumbc Jun 3, 2022
a4a1d1d
cleanup of typing hints
rumbc Jun 3, 2022
6bcb8a8
Merge pull request #14 from icfly2/merge-xsimw
icfly2 Jun 3, 2022
2f17d94
last changes not saved. Thanks VS code
rumbc Jun 3, 2022
4e27932
Merge pull request #16 from icfly2/merge-xsimw
icfly2 Jun 3, 2022
c2a488a
added docs
rumbc Jun 3, 2022
efe7c53
Merge pull request #17 from icfly2/docs
icfly2 Jun 3, 2022
4665f2f
added excpet for ModuleNotFoundError for optional packages
rumbc Jun 3, 2022
6aa8738
added project urls
rumbc Jun 3, 2022
b7ba56d
added docs build info
rumbc Jun 3, 2022
5c41ca6
Updated signature
Jun 8, 2022
2699649
Merge pull request #1 from xsimw/change_ranked_signature
Jun 8, 2022
3312b5a
changed docs
rumbc Jun 8, 2022
b0fcdca
fixed a few small bugs
rumbc Jun 8, 2022
8bc9af0
rolled back github actions regression
rumbc Jun 8, 2022
b0b2970
updated readme
rumbc Jun 8, 2022
74fdef3
Merge pull request #19 from banking-circle-advanced-analytics/save
icfly2 Jun 8, 2022
b27b2f0
Merge branch 'banking-circle-advanced-analytics:master' into master
Jun 13, 2022
34a51a1
Merge branch 'banking-circle-advanced-analytics:master' into change_r…
Jun 13, 2022
dbcf99f
Updated tests
Jun 13, 2022
bf0337d
Merge pull request #2 from xsimw/change_ranked_signature
Jun 13, 2022
de60ce0
Change type notation
Jun 13, 2022
5ab2c2d
removed 3.6
rumbc Jun 13, 2022
c5f3695
Merge branch 'banking-circle-advanced-analytics:master' into master
Jun 13, 2022
d7fdff2
Merge pull request #20 from xsimw/master
icfly2 Jun 13, 2022
1dd65c0
updated docs and version 0.2.0
rumbc Jun 13, 2022
a91c72a
Merge branch 'master' of https://github.com/icfly2/simstring-fast
rumbc Jun 13, 2022
cacb887
changed workflow
rumbc Jun 13, 2022
e4a5760
reformat and better test github action
rumbc Jun 14, 2022
3344d5f
version bump
rumbc Jun 28, 2022
b713946
fixed links
rumbc Jun 28, 2022
5073499
remved MEcab and made sure tests check build too
rumbc Jun 28, 2022
645c590
Merge pull request #21 from banking-circle-advanced-analytics/fix_build
icfly2 Jun 28, 2022
e559b8c
Update setup.py
icfly2 Jun 28, 2022
d0dc05c
removed MeCab, pymongo and added pytoml file
rumbc Jun 28, 2022
f50fe5c
fixed spelling
icfly2 Oct 23, 2022
e84233b
added hatch as backend
rumbc Jan 16, 2023
dbd0891
updated toml to reflect python version testing
rumbc Jan 16, 2023
2394460
move off setup.py in automated actions
rumbc Jan 16, 2023
76ce888
workflows moved to token and hatch
rumbc Jan 16, 2023
5dc1c9c
Merge pull request #24 from banking-circle-advanced-analytics/move_to…
icfly2 Jan 27, 2023
633880d
Merge branch 'master' of github.com:icfly2/simstring-fast
icfly2 Feb 3, 2023
f218477
intial change, test fails
icfly2 Feb 4, 2023
f83dc00
fixed signature
icfly2 Feb 4, 2023
03160b9
make test match
icfly2 Feb 4, 2023
d39dcb9
no more byte assertion
icfly2 Feb 4, 2023
4e19e66
update test
icfly2 Feb 4, 2023
8e19594
change pickle save
icfly2 Feb 4, 2023
5cebe40
use eval instead of ast
icfly2 Feb 4, 2023
5aa23d9
basic fix
icfly2 Feb 4, 2023
0cfe404
complex comparission
icfly2 Feb 4, 2023
b454a57
small cleanup
rumbc Feb 10, 2023
a079d12
build workaround and docs
rumbc Feb 10, 2023
6dc4720
no matrix, use hatch
rumbc Feb 10, 2023
36a24ee
add matrix back, for stupid reasons
rumbc Feb 10, 2023
a7be7aa
Merge pull request #25 from banking-circle-advanced-analytics/json_save
cmkarsten Feb 10, 2023
3d346a9
no default dict adn extra types, makes things faster
rumbc May 10, 2023
f1dec9a
Updated icon, types and docs
rumbc May 10, 2023
341a204
updated version and docs
rumbc May 11, 2023
254a7bd
Merge pull request #27 from banking-circle-advanced-analytics/faster
icfly2 May 11, 2023
a3a9962
added dist upload
rumbc May 11, 2023
ad93946
intitial publish workflow with wheels
rumbc May 11, 2023
84952e2
upload whls and tar.gz files
rumbc May 11, 2023
8f5b93e
trying the upload off all builds
rumbc May 11, 2023
526b004
updated publish
rumbc May 12, 2023
d439dc0
Merge pull request #28 from banking-circle-advanced-analytics/better_…
icfly2 May 12, 2023
0634527
Create publish.yml
icfly2 May 12, 2023
454937f
Merge pull request #29 from banking-circle-advanced-analytics/icfly2-…
icfly2 May 12, 2023
810f44d
Unlike advertised, the action did not work...
rumbc May 12, 2023
cb5bf10
Merge pull request #30 from banking-circle-advanced-analytics/try_again
icfly2 May 12, 2023
3b8cb2a
arg.
rumbc May 12, 2023
1767f5b
Merge pull request #31 from banking-circle-advanced-analytics/try_again
icfly2 May 12, 2023
d37b414
try and build on more
rumbc May 12, 2023
deb1a1f
ran black and flake8
rumbc May 12, 2023
cb612c6
update version
rumbc May 12, 2023
918d1ef
Merge pull request #32 from banking-circle-advanced-analytics/more_bu…
icfly2 May 12, 2023
ee75c3d
initial change
rumbc May 12, 2023
0d7add0
skip existing
rumbc May 12, 2023
b99e731
upgrade and add black stuff
rumbc May 12, 2023
d568fcd
Merge branch 'master' into new_build
icfly2 May 12, 2023
f125520
Merge pull request #33 from banking-circle-advanced-analytics/new_build
icfly2 May 12, 2023
3da5b6c
add skip
rumbc May 12, 2023
4aaa039
Initial cleanup
rumbc May 12, 2023
96ff0cf
Merge pull request #34 from banking-circle-advanced-analytics/cleanup
icfly2 May 12, 2023
b81b3ab
made diskcache work, add still too slow
rumbc Aug 15, 2023
e617dc5
wip, faster add
rumbc Aug 15, 2023
09dba7c
improved benchmarking
rumbc Aug 15, 2023
83982be
clean up testing
rumbc Aug 16, 2023
835ea68
fix tests
rumbc Aug 16, 2023
55f3e62
ADD WITH
rumbc Aug 30, 2023
5053aa3
teting paralelism
rumbc Aug 30, 2023
9e1b4be
more testing
rumbc Aug 30, 2023
85fc8be
remove the unused max and min feature size from the DBs
rumbc Aug 30, 2023
78e9750
Merge pull request #38 from banking-circle-advanced-analytics/remove_…
icfly2 Aug 30, 2023
cdc00cf
Merge branch 'master' into add_diskcache
icfly2 Aug 30, 2023
cc5d0b7
run tests always
rumbc Sep 1, 2023
9b05938
update to diskcache==5.6.3
rumbc Sep 1, 2023
88680f5
trying to make the
rumbc Sep 1, 2023
810187c
make all files unique
rumbc Sep 1, 2023
8c9c1ad
try again??
rumbc Sep 1, 2023
378a608
fix path so windows is happy
rumbc Sep 1, 2023
9ce1551
fix f string
rumbc Sep 1, 2023
b88e7ed
fix imports
rumbc Sep 1, 2023
87e606b
fix windowns teardown bug
rumbc Sep 1, 2023
42db893
windows not teardown
rumbc Sep 1, 2023
cec5eac
check for avd
rumbc Sep 1, 2023
cbb25f8
go mad
rumbc Sep 1, 2023
e17caae
just ry and except
rumbc Sep 1, 2023
00eb865
Merge pull request #36 from banking-circle-advanced-analytics/add_dis…
icfly2 Sep 1, 2023
2018cc5
clean up
rumbc Sep 1, 2023
5c7e25c
addded option for endchar
rumbc Oct 24, 2023
73982d6
removed old python versions from support
rumbc Oct 24, 2023
9fec8a1
windows is borked
rumbc Oct 24, 2023
93d314d
Merge pull request #39 from banking-circle-advanced-analytics/order_i…
icfly2 Oct 24, 2023
1da313f
adding a 3.12 setuptools requirement for now
rumbc Aug 19, 2024
0e8ffe4
Update test.yml to also test 3.12
icfly2 Aug 21, 2024
7b8e6da
Update publish.yml to also build 3.12
icfly2 Aug 21, 2024
e4142d3
Merge pull request #41 from banking-circle-advanced-analytics/bump/py…
icfly2 Aug 22, 2024
0ee35cc
Update publish.yml
icfly2 Nov 12, 2024
e37b1a6
Merge pull request #42 from banking-circle-advanced-analytics/change-…
icfly2 Nov 12, 2024
292c691
skip 3.8
rumbc Nov 14, 2024
b53f841
skip pypy and python matrix
rumbc Nov 14, 2024
5cb86eb
fix save and load add tests to build
rumbc Nov 14, 2024
8d36a0a
runs locally
rumbc Nov 14, 2024
290cf59
remove the nose test classes
rumbc Nov 14, 2024
fdc231c
migrate last test
rumbc Nov 14, 2024
83201c6
make bublish run normally again
rumbc Nov 14, 2024
39def6e
add save test
rumbc Nov 14, 2024
e60cbde
make save work in 3.13
rumbc Nov 14, 2024
642bb21
adjust test matrix
rumbc Nov 14, 2024
2bbf863
bump version
rumbc Nov 15, 2024
7d56dae
Merge pull request #43 from banking-circle-advanced-analytics/cleanup
icfly2 Nov 15, 2024
2930a98
updating
rumbc May 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 0 additions & 30 deletions .circleci/config.yml

This file was deleted.

69 changes: 69 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
name: Upload to PyPi

on:
release:
types: [created]

jobs:
build_wheels:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest]
name: Build on ${{ matrix.os }}

steps:
- name: Checkout Code
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4

- name: Install cibuildwheel
run: python -m pip install cibuildwheel==2.21.3

- name: Build wheels
run: python -m cibuildwheel --output-dir wheelhouse
env:
CIBW_SKIP: cp38-* pp*
CIBW_TEST_REQUIRES: pytest faker
CIBW_TEST_COMMAND: pytest {project}/tests

- uses: actions/upload-artifact@v4
with:
name: cibw-wheels-${{ matrix.os }}-${{ strategy.job-index }}
path: ./wheelhouse/*.whl

build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Build sdist
run: pipx run build --sdist

- uses: actions/upload-artifact@v4
with:
name: cibw-sdist
path: dist/*.tar.gz

upload_pypi:
needs: [build_wheels, build_sdist]
runs-on: ubuntu-latest
environment: pypi
steps:
- uses: actions/download-artifact@v4
with:
# unpacks all CIBW artifacts into dist/
pattern: cibw-*
path: dist
merge-multiple: true

- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: __token__
password: ${{ secrets.PYPI_SIMSTRING_UPLOAD_TOKEN }}
skip_existing: true

46 changes: 46 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: Test

on:
push:
branches:
- '*'
pull_request:
branches: [ master ]

jobs:
build:

runs-on: ${{ matrix.os }}
strategy:
matrix:
# os: [ubuntu-latest, windows-latest]
os: [ubuntu-latest]
python-version: [ '3.9', '3.10', '3.11', '3.12','3.13']
name: Python ${{ matrix.python-version }} on ${{ matrix.os }}

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest hatch
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

- name: Test with hatch and pytest
run: |
hatch run test:cov

- name: Install module and test again
run: |
hatch run test:build
pip install --find-links=dist simstring-fast
hatch run test:cov
17 changes: 17 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
env/
cenv/
senv/
mypyenv/

**/__pycache__/**
*.egg-info/
build/
dist/

dev/data/geo_address.csv
dev/geo_matching.py
*.so
.coverage
simstring/site/
tmp*/
addresses.csv
104 changes: 16 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
# simstring
[![PyPI - Status](https://img.shields.io/pypi/status/simstring-pure.svg)](https://pypi.org/project/simstring-pure/)
[![PyPI version](https://badge.fury.io/py/simstring-pure.svg)](https://badge.fury.io/py/simstring-pure)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/simstring-pure.svg)](https://pypi.org/project/simstring-pure/0.0.1/)
[![PyPI - Status](https://img.shields.io/pypi/status/simstring-fast.svg)](https://pypi.org/project/simstring-fast/)
[![PyPI version](https://badge.fury.io/py/simstring-fast.svg)](https://badge.fury.io/py/simstring-fast)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/simstring-fast)
[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](LICENSE)
[![CircleCI](https://circleci.com/gh/nullnull/simstring.svg?style=svg)](https://circleci.com/gh/nullnull/simstring)
[![Maintainability](https://api.codeclimate.com/v1/badges/66eb2018262f03ece8a3/maintainability)](https://codeclimate.com/github/nullnull/simstring/maintainability)

![icon](simstring/docs/strings_icon.png)

A Python implementation of the [SimString](http://www.chokkan.org/software/simstring/index.html.en), a simple and efficient algorithm for approximate string matching.

Docs are [here](https://banking-circle-advanced-analytics.github.io/simstring-fast/)

## Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.

Expand All @@ -20,14 +21,13 @@ SimString has the following features:
* 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
* Unicode support.
* Extensibility. You can implement your own feature extractor easily.
* Japanese support. [MeCab](http://taku910.github.io/mecab/)を使った形態素Nグラムをサポートしています。

* no japanese support
[Please see this paper for more details](http://www.aclweb.org/anthology/C10-1096).


## Install
```
pip install simstring-pure
pip install simstring-fast
```

## Usage
Expand All @@ -53,10 +53,10 @@ If you want to use other feature, measure, and database, simply replace these cl
```python
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.mongo import MongoDatabase
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = MongoDatabase(WordNgramFeatureExtractor(2))
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')

searcher = Searcher(db, JaccardMeasure())
Expand All @@ -68,83 +68,11 @@ print(results)
- Cosine
- Dice
- Jaccard
- Overlap
- Left Overlap

## Run Tests
```
docker-compose run main bash -c 'source activate simstring && python -m unittest discover tests'
```

## Benchmark
* About 1ms to search strings from 5797 strings(company names).
* About 14ms to search strings from 235544 strings(unabridged dictionary).

#### search from `dev/data/company_names.txt`
```
$ python dev/benchmark.py
benchmark for using dict as database
## benchmarker: release 4.0.1 (for python)
## python version: 3.7.0
## python compiler: GCC 7.2.0
## python platform: Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable: /opt/conda/envs/simstring/bin/python
## cpu model: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz # 3300.000 MHz
## parameters: loop=1, cycle=1, extra=0

## real (total = user + sys)
initialize database(5797 lines) 0.1227 0.1200 0.1200 0.0000
search text(5797 times) 6.9719 6.9400 6.8900 0.0500

## Ranking real
initialize database(5797 lines) 0.1227 (100.0) ********************
search text(5797 times) 6.9719 ( 1.8)

## Matrix real [01] [02]
[01] initialize database(5797 lines) 0.1227 100.0 5680.9
[02] search text(5797 times) 6.9719 1.8 100.0

benchmark for using Mongo as database
## benchmarker: release 4.0.1 (for python)
## python version: 3.7.0
## python compiler: GCC 7.2.0
## python platform: Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable: /opt/conda/envs/simstring/bin/python
## cpu model: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz # 3300.000 MHz
## parameters: loop=1, cycle=1, extra=0

## real (total = user + sys)
initialize database(5797 lines) 4.5762 2.4900 1.9200 0.5700
search text(5797 times) 177.8401 60.9100 47.2500 13.6600

## Ranking real
initialize database(5797 lines) 4.5762 (100.0) ********************
search text(5797 times) 177.8401 ( 2.6) *

## Matrix real [01] [02]
[01] initialize database(5797 lines) 4.5762 100.0 3886.2
[02] search text(5797 times) 177.8401 2.6 100.0
```

#### search from `dev/data/unabridged_dictionary.txt`
```
$ python dev/benchmark.py
benchmark for using dict as database
## benchmarker: release 4.0.1 (for python)
## python version: 3.7.0
## python compiler: GCC 7.2.0
## python platform: Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable: /opt/conda/envs/simstring/bin/python
## cpu model: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz # 3300.000 MHz
## parameters: loop=1, cycle=1, extra=0

## real (total = user + sys)
initialize database(235544 lines) 2.2576 2.2300 2.1200 0.1100
search text(10000 times) 141.0302 140.6400 139.9600 0.6800

## Ranking real
initialize database(235544 lines) 2.2576 (100.0) ********************
search text(10000 times) 141.0302 ( 1.6)

## Matrix real [01] [02]
[01] initialize database(235544 lines) 2.2576 100.0 6246.8
[02] search text(10000 times) 141.0302 1.6 100.0
```
## Supported database backends
- dictionary
- diskcache (sqlite)
- redis (in development #37)
47 changes: 36 additions & 11 deletions dev/benchmark.py
Original file line number Diff line number Diff line change
@@ -1,40 +1,65 @@
# coding: utf-8

"""Module benchmarking

This is code to benchmakr the performance of the module.

Requires benchmarker as an additional dependency. Run from main folder with 'python dev/benchmark.py'

"""

import os, sys

sys.path.append(os.getcwd())
from benchmarker import Benchmarker

from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.mongo import MongoDatabase
from simstring.measure.overlap import OverlapMeasure, LeftOverlapMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
from time import time

SEARCH_COUNT_LIMIT = 10**3

SEARCH_COUNT_LIMIT = 10**4

def output_similar_strings_of_each_line(path, Database):
number_of_lines = len(open(path).readlines())

with Benchmarker(width=20) as bench:
db = Database(CharacterNgramFeatureExtractor(2))

@bench("initialize database({0} lines)".format(number_of_lines))
def _(bm):
with open(path, 'r') as lines:
with open(path, "r") as lines:
for line in lines:
strings = line.rstrip('\r\n')
strings = line.rstrip("\r\n")
db.add(strings)

@bench("search text({0} times)".format(min(number_of_lines, SEARCH_COUNT_LIMIT)))
@bench(
"search text({0} times)".format(min(number_of_lines, SEARCH_COUNT_LIMIT))
)
def _(bm):
searcher = Searcher(db, CosineMeasure())
with open(path, 'r') as lines:
with open(path, "r") as lines:
for i, line in enumerate(lines):
if i >= SEARCH_COUNT_LIMIT:
break
strings = line.rstrip('\r\n')
strings = line.rstrip("\r\n")
result = searcher.search(strings, 0.8)

print('benchmark for using dict as database')
output_similar_strings_of_each_line('./dev/data/company_names.txt', DictDatabase)
print('benchmark for using Mongo as database')
output_similar_strings_of_each_line('./dev/data/company_names.txt', MongoDatabase)

print("benchmark for using dict as database")
start = time()
output_similar_strings_of_each_line("./dev/data/company_names.txt", DictDatabase)
print(f"Benchmark took {time()-start:.2f}s.")

try:
from simstring.database.mongo import MongoDatabase

print("benchmark for using Mongo as database")
start = time()
output_similar_strings_of_each_line("./dev/data/company_names.txt", MongoDatabase)
print(f"Benchmark took {time()-start:.2f}s.")
except ModuleNotFoundError:
print("Pymongo not installed, won't benchmark against MongoDB")
Loading