JamSpell is a spell checking library with following features:
- accurate - it consider words surroundings (context) for better correction
- fast - near 5K words per second
- multi-language - it's written in C++ and available for many languages with swig bindings
| Errors | Top 7 Errors | Fix Rate | Top 7 Fix Rate | Broken | Speed (words/second) |
|
| JamSpell | 3.25% | 1.27% | 79.53% | 84.10% | 0.64% | 4854 |
| Norvig | 7.62% | 5.00% | 46.58% | 66.51% | 0.69% | 395 |
| Hunspell | 13.10% | 10.33% | 47.52% | 68.56% | 7.14% | 163 |
| Dummy | 13.14% | 13.14% | 0.00% | 0.00% | 0.00% | - |
Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).
We used following metrics:
- Errors - percent of words with errors after spell checker processed
- Top 7 Errors - percent of words missing in top7 candidated
- Fix Rate - percent of errored words fixed by spell checker
- Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
- Broken - percent of non-errored words broken by spell checker
- Speed - number of words per second
To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:
| Errors | Top 7 Errors | Fix Rate | Top 7 Fix Rate | Broken | Speed (words per second) | |
| JamSpell | 3.56% | 1.27% | 72.03% | 79.73% | 0.50% | 5524 |
| Norvig | 7.60% | 5.30% | 35.43% | 56.06% | 0.45% | 647 |
| Hunspell | 9.36% | 6.44% | 39.61% | 65.77% | 2.95% | 284 |
| Dummy | 11.16% | 11.16% | 0.00% | 0.00% | 0.00% | - |
More details about reproducing available in "Train" section.
-
Install
swig3(usually it is in your distro package manager) -
Install
jamspell:
pip install jamspellimport jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('model_en.bin')
corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)-
Add
jamspellandcontribdirs to your project -
Use it:
#include <jamspell/spell_corrector.hpp>
int main(int argc, const char** argv) {
NJamSpell::TSpellCorrector corrector;
corrector.LoadLangModel("model.bin");
corrector.FixFragment(L"I am the begt spell cherken!");
// "I am the best spell checker!"
corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
// "best", "beat", "belt", "bet", "bent", ... )
corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
// "checker", "chicken", "checked", "wherein", "coherent", ... )
return 0;
}You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.
- Will run on port 80, open to anyone (not just localhost) by default.
- Expects the model to be in the same folder as webserver.py and be named
medical_model.bin(since this fork is for the medical spell check) - Gives a few more options than the c++ option. Specifically these params can be sent with the GET or POST api call
limit... limit number of items per candidate on response from the /candidates endpoint to this i.e./candidates?limit=1&text=blahblahhtml... if set, will return a human-readable html table instead of json. Works for /fix and /candidates i.e./fix?html=1&text=blahblah
python webserver.py-
Install
cmake -
Clone and build medSpellCheck (it includes http server):
git clone https://github.com/jackneil/medSpellCheck.git
cd medSpellCheck
mkdir build
cd build
cmake ..
makeon Windows replace the 'make' command with:
cmake --build . --target ALL_BUILD --config Release./web_server/web_server en.bin localhost 8080- GET Request example:
$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker- POST Request example
$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker- Candidate example
curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates{
"results": [
{
"candidates": [
"best",
"beat",
"belt",
"bet",
"bent",
"beet",
"beit"
],
"len": 4,
"pos_from": 9
},
{
"candidates": [
"checker",
"chicken",
"checked",
"wherein",
"coherent",
"cheered",
"cherokee"
],
"len": 7,
"pos_from": 20
}
]
}Here pos_from - misspelled word first letter position, len - misspelled word len
To train custom model you need:
-
Install
cmake -
Clone and build medSpellCheck:
git clone https://github.com/jackneil/medSpellCheck.git
cd medSpellCheck
mkdir build
cd build
cmake ..
make- SPECIAL WINDOWS INSTRUCTIONS for building:
- MUST HAVE Visual Studio 2019 Community Edition (or greater) installed as well as Visual Studio 2019 C++ Build Tools!!!
cmake ..will build a shit .exe unless you've followed ^^^- replace the 'make' command with: (note that the jamspell.exe executable will be located in the /build/main/Release/ folder)
cmake --build . --target ALL_BUILD --config Release
-
Prepare a utf-8 text file with sentences to train at (eg.
sherlockholmes.txt) and another file with language alphabet (eg.alphabet_en.txt) -
Train model:
./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin- To evaluate spellchecker you can use
evaluate/evaluate.pyscript:
python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt-
You can use
evaluate/generate_dataset.pyto generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books. -
Send it stuff like this:
curl "http://localhost:55555/candidates?text=This is a 62 yer old femle with high blod pressur and she has had a lap appendectoy by an aneesthesiologist also she has dibetes mellitus. she takes 50mg of metopfolol per day and an 81mg asprin and 15miligram hydrochlorathiozide plus his mother is a smker and has had a bunch of seezures. they like icee creem and pzza. hx of coranary artery dizease and has had a transeent ishcemic attak"
Here is our hank.ai medical model pre-trained on a large medical corpus (a few million records):
- medical_model.zip (180mb)
Here are a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.