fitrat

An NLP library for Uzbek. It includes morphological analysis, transliterators, language identifiers, tokenizers and many more.

It is named after historian and linguist Abdurauf Fitrat, who was one of the creators of Modern Uzbek as well as the first Uzbek professor.

Requirements

Python 3.9 - 3.12

Installation

pip

pip install fitrat

uv

uv add fitrat

Usage

Transliteration

We used hfst library for creating transliterators. This library provides finite-state transducers, a finite-state machines that come very handy for efficient mapping one text to another.

from fitrat import Transliterator, WritingType

t = Transliterator(to=WritingType.LAT)
result = t.convert("Кеча циркка бордим.")
print(result)
# Kecha sirkka bordim.

t2 = Transliterator(to=WritingType.CYR)
result = t2.convert("Kecha sirkka bordim.")
print(result)
# Кеча циркка бордим.

While Cyrillic-Latin conversion is rule-based and simple, the converse is not true. We included special pre-compiled exceptions transducer for Latin-Cyrillic that handles all (to our knowledge) exceptions. We'll continue working on improving on our exceptions list.

If you want to compile the transliterators from source, use the hfst library. The package uses only pre-compiled binaries and hfstol library for efficient lookup.

Language Identification

We can recognize Uzbek text, both Latin or Cyrillic. Additionally, we can recognize other major languages, such as Russian, English, Arabic and etc.

from fitrat import LanguageDetector

lang_detector = LanguageDetector()

print(lang_detector.is_uzbek("bu o'zbekchada yozilgan matn"))
# True

print(lang_detector.is_uzbek("бу нотугри йозилган булсаям, лекин узбекча матн"))
# True

print(lang_detector.is_uzbek("Текст на русском языке"))
# False

Tokenization

from fitrat import word_tokenize

s = "Bugun o'zbekchada gapirishga qaror qildim!"
print(word_tokenize(s))
# ['Bugun', "o'zbekchada", 'gapirishga', 'qaror', 'qildim', '!']

Development

Setup

# uv (recommended)
uv sync --extra dev

# pip
pip install -e ".[dev]"

Running tests

# uv
uv run pytest

# pip
python -m pytest

Linting

# uv
uv run ruff check src/

# pip
python -m ruff check src/

Type checking

# uv
uv run pyright src/

# pip
python -m pyright src/

Authors

Mukhammadsaid Mamasaidov
Jasur Yusupov

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
assets		assets
src/fitrat		src/fitrat
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fitrat

Requirements

Installation

pip

uv

Usage

Transliteration

Language Identification

Tokenization

Development

Setup

Running tests

Linting

Type checking

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fitrat

Requirements

Installation

pip

uv

Usage

Transliteration

Language Identification

Tokenization

Development

Setup

Running tests

Linting

Type checking

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages