A high-performance Turkish tokenizer with Rust backend and Python wrapper, designed for efficient natural language processing of Turkish text.
- High Performance: Rust backend for fast tokenization
- Turkish Language Support: Optimized for Turkish morphology and grammar
- Python Integration: Easy-to-use Python wrapper
- Comprehensive Coverage: Handles Turkish roots, suffixes, and BPE tokens
- Command Line Interface: CLI tool for batch processing
pip install turkish-tokenizerfrom turkish_tokenizer import TurkishTokenizer
# Initialize the tokenizer
tokenizer = TurkishTokenizer()
# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens = tokenizer.tokenize(text)
print(tokens)
# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(decoded_text)# Tokenize a text file
turkish-tokenizer tokenize input.txt output.txt
# Decode tokens back to text
turkish-tokenizer decode input_tokens.txt output_text.txtThe main tokenizer class.
tokenize(text: str) -> List[int]: Tokenize input text into token IDsdecode(tokens: List[int]) -> str: Decode token IDs back to textencode(text: str) -> List[int]: Alias for tokenize method
- Clone the repository
- Install development dependencies:
pip install -e ".[dev]" - Run tests:
pytest
python -m buildMIT License - see LICENSE file for details.
- FIXED: JSON vocabulary files are now properly included in the package distribution
- FIXED: MANIFEST.in corrected to include JSON files from the right directory structure
- FIXED: Package data configuration updated to ensure JSON files are bundled
- ADDED: Command line interface (CLI) for batch processing
- ADDED: Comprehensive test suite
- IMPROVED: Better error handling and validation
- IMPROVED: Enhanced documentation and examples
- FIXED: Package metadata and dependencies
- IMPROVED: Better package structure and organization
- INITIAL: First release with basic tokenization functionality
- FEATURES: Turkish root and suffix matching
- FEATURES: BPE tokenization support
- FEATURES: Python wrapper for Rust backend
Contributions are welcome! Please feel free to submit a Pull Request.
Ali Bayram - malibayram20@gmail.com