Turkish Tokenizer

A high-performance Turkish tokenizer with Rust backend and Python wrapper, designed for efficient natural language processing of Turkish text.

Features

High Performance: Rust backend for fast tokenization
Turkish Language Support: Optimized for Turkish morphology and grammar
Python Integration: Easy-to-use Python wrapper
Comprehensive Coverage: Handles Turkish roots, suffixes, and BPE tokens
Command Line Interface: CLI tool for batch processing

Installation

pip install turkish-tokenizer

Quick Start

from turkish_tokenizer import TurkishTokenizer

# Initialize the tokenizer
tokenizer = TurkishTokenizer()

# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens = tokenizer.tokenize(text)
print(tokens)

# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(decoded_text)

Command Line Usage

# Tokenize a text file
turkish-tokenizer tokenize input.txt output.txt

# Decode tokens back to text
turkish-tokenizer decode input_tokens.txt output_text.txt

API Reference

TurkishTokenizer

The main tokenizer class.

Methods

tokenize(text: str) -> List[int]: Tokenize input text into token IDs
decode(tokens: List[int]) -> str: Decode token IDs back to text
encode(text: str) -> List[int]: Alias for tokenize method

Development

Setup

Clone the repository
Install development dependencies: pip install -e ".[dev]"
Run tests: pytest

Building

python -m build

License

MIT License - see LICENSE file for details.

Changelog

0.1.3 (2024-12-19)

FIXED: JSON vocabulary files are now properly included in the package distribution
FIXED: MANIFEST.in corrected to include JSON files from the right directory structure
FIXED: Package data configuration updated to ensure JSON files are bundled

0.1.2 (2024-12-19)

ADDED: Command line interface (CLI) for batch processing
ADDED: Comprehensive test suite
IMPROVED: Better error handling and validation
IMPROVED: Enhanced documentation and examples

0.1.1 (2024-12-19)

FIXED: Package metadata and dependencies
IMPROVED: Better package structure and organization

0.1.0 (2024-12-19)

INITIAL: First release with basic tokenization functionality
FEATURES: Turkish root and suffix matching
FEATURES: BPE tokenization support
FEATURES: Python wrapper for Rust backend

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

Ali Bayram - malibayram20@gmail.com

Repository

https://github.com/malibayram/turkish-tokenizer

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
llama3_2		llama3_2
qwen_tokenizer		qwen_tokenizer
src		src
tests		tests
turkish_tokenizer		turkish_tokenizer
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build_package.py		build_package.py
export_vocabularies.py		export_vocabularies.py
fast_matching.py		fast_matching.py
mapping_embeddings.ipynb		mapping_embeddings.ipynb
mapping_lm_head.ipynb		mapping_lm_head.ipynb
mapping_new_tokens_to_embeddings.ipynb		mapping_new_tokens_to_embeddings.ipynb
mapping_raw_model.ipynb		mapping_raw_model.ipynb
mapping_to_training.ipynb		mapping_to_training.ipynb
mapping_tokens.ipynb		mapping_tokens.ipynb
pyproject.toml		pyproject.toml
turkish_tokenizer_wrapper.py		turkish_tokenizer_wrapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turkish Tokenizer

Features

Installation

Quick Start

Command Line Usage

API Reference

TurkishTokenizer

Methods

Development

Setup

Building

License

Changelog

0.1.3 (2024-12-19)

0.1.2 (2024-12-19)

0.1.1 (2024-12-19)

0.1.0 (2024-12-19)

Contributing

Author

Repository

About

Uh oh!

Releases

Packages

Languages

License

malibayram/custom-embedder

Folders and files

Latest commit

History

Repository files navigation

Turkish Tokenizer

Features

Installation

Quick Start

Command Line Usage

API Reference

TurkishTokenizer

Methods

Development

Setup

Building

License

Changelog

0.1.3 (2024-12-19)

0.1.2 (2024-12-19)

0.1.1 (2024-12-19)

0.1.0 (2024-12-19)

Contributing

Author

Repository

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages