Skip to content

WIP: Use Finite State Transducers (FST) as the backing store for language models #458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adamreichold
Copy link

@adamreichold adamreichold commented Mar 24, 2025

FST have the nice properties of both compressing the ngram data by exploiting common suffixes and prefixes as well as being embeddable into the binary in a form that is directly suitable for look-up thereby avoiding the separate decompression step and indireclty using memory mappings as supplied by the operating system for all binaries.

This is still WIP since I do not know how to regenerate the language models and it also seems like the unique models are built elsewhere.

TODO:

  • Integrate unique ngram models
  • Regenerate all language models
  • Drop non-unified models

Closes #121

@adamreichold adamreichold force-pushed the fst-storage branch 3 times, most recently from 0a648a3 to eab2273 Compare March 24, 2025 09:22
…uage models

FST have the nice properties of both compressing the ngram data by exploiting
common suffixes and prefixes as well as being embeddable into the binary in a
form that is directly suitable for look-up thereby avoiding the separate
decompression step and indireclty using memory mappings as supplied by the
operating system for all binaries.

This is still WIP since I do not know how to regenerate the language models and
it also seems like the unique ngram models are built elsewhere.

TODO:
* Integrate unique ngram models
* Regenerate all language models
* Drop non-unified models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce resources to load language models
1 participant