WIP: Use Finite State Transducers (FST) as the backing store for language models #458
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
FST have the nice properties of both compressing the ngram data by exploiting common suffixes and prefixes as well as being embeddable into the binary in a form that is directly suitable for look-up thereby avoiding the separate decompression step and indireclty using memory mappings as supplied by the operating system for all binaries.
This is still WIP since I do not know how to regenerate the language models and it also seems like the unique models are built elsewhere.
TODO:
Closes #121