A comprehensive research project focused on developing advanced tokenization methods for the Turkish language, incorporating linguistic rules, morphological analysis, and semantic understanding.
This project explores multiple approaches to Turkish text tokenization, combining traditional NLP techniques with modern machine learning methods. The goal is to create tokenizers that understand Turkish morphology and semantics, leading to better performance in downstream NLP tasks.
- Semantic Tokenization: Converts Turkish morphological suffixes to language-independent semantic tokens
- Morphological Analysis: Root-suffix separation using Turkish grammatical rules
- Grammatical Event Reversal: Handles Turkish phonological processes (vowel harmony, consonant softening, etc.)
- BPE Integration: Modern subword tokenization with Turkish-specific preprocessing
- Multiple Tokenizer Implementations: Various approaches for different use cases
Semantic tokenization approach that focuses on meaning preservation through grammatical analysis.
Key Components:
- Grammatical Event Revertor (GER) - Reverses Turkish phonological changes
- Semantic Converter - Maps suffixes to language-independent tokens
- Root extraction and phonological process handling
Main Files:
guncel_strateji.md- Current strategy documentationtokenizer_v01.ipynb- Main implementation notebook- Various JSON files containing roots, suffixes, and mappings
Turkish word segmentation tool using morphological analysis with ITU NLP tools integration.
Key Components:
- Root-suffix separation using linguistic rules
- ITU NLP tools integration for morphological analysis
- Frequency-based vocabulary building
- GUI interface for interactive use
Main Files:
kelime_bol.py- Core word segmentation algorithmkokbul.py- Root finding utilitiesgui/- Graphical user interfaceveri/- Training data and word lists
BPE (Byte Pair Encoding) tokenizer training and preparation pipeline.
Key Components:
- Custom BPE tokenizer training
- Frequency analysis and vocabulary optimization
- Integration with Hugging Face tokenizers
- Performance evaluation tools
Main Files:
train_tokenizer.py- Training pipelinebyte_pair_tokenizer.ipynb- BPE implementation- Various JSON tokenizer configurations
Install the required dependencies:
pip install -r requirements.txt# Navigate to semantic_tokenizer directory
cd semantic_tokenizer
# Open the main notebook
jupyter notebook tokenizer_v01.ipynbfrom tr_tokenizer.kelime_bol import kok_tara
# Analyze a Turkish word
word = "kitaplarımızdan"
result = kok_tara(word)
print(f"Analysis: {result}")# Navigate to tokenizer_preparation directory
cd tokenizer_preparation
# Run the training script
python train_tokenizer.pyThe semantic tokenizer implements a two-step process:
-
Grammatical Event Reversal: Identifies and reverses Turkish phonological processes
- Vowel contraction (Ünlü Daralması)
- Vowel dropping (Ünlü Düşmesi)
- Consonant softening (Ünsüz Yumuşaması)
- And other Turkish-specific sound changes
-
Semantic Conversion: Maps morphological suffixes to semantic tokens
- Language-independent representation
- Meaning preservation across different surface forms
Example:
geldiği → gel (root) + di (past) + ğ (welding) + i (possessive)
→ gel + <past-suffix> + <possessive-suffix>
Uses rule-based morphological analysis combined with statistical methods:
- Root identification using comprehensive Turkish root dictionaries
- Suffix segmentation based on Turkish morphological rules
- Integration with ITU NLP tools for validation
- Frequency-based vocabulary optimization
Combines traditional BPE with Turkish-specific preprocessing:
- Morphology-aware subword segmentation
- Custom vocabulary with high-frequency Turkish roots and suffixes
- Integration with modern transformer tokenizers
- Turkish Wikipedia corpus
- ITU Turkish NLP datasets
- Custom root and suffix dictionaries
- Frequency-analyzed word lists
- Longest root matching for morphological segmentation
- Phonological rule application for surface form generation
- BPE training with linguistic constraints
- Semantic mapping for cross-lingual representation
This tokenizer project supports research in:
- Machine Translation: Better handling of Turkish morphology
- Language Modeling: Improved representation of Turkish text
- Cross-lingual NLP: Semantic token mapping across languages
- Morphological Analysis: Automated Turkish text analysis
This is an active research project with ongoing development in multiple areas:
- ✅ Basic morphological segmentation
- ✅ Semantic token mapping for common suffixes
- ✅ BPE tokenizer training pipeline
- ✅ GUI interface for interactive analysis
- 🔄 Cross-lingual semantic mapping
- 🔄 Performance optimization
- 🔄 Comprehensive evaluation metrics
This is a research project. For collaboration or questions:
- Review the methodology in
semantic_tokenizer/guncel_strateji.md - Examine the implementation notebooks
- Check the current development status in individual README files
- Python 3.8+
- Jupyter Notebook
- NumPy, Pandas for data processing
- PyTorch for neural components
- Transformers library for modern tokenizer integration
This project contributes to the field of morphologically rich language processing, specifically addressing challenges in Turkish NLP:
- Agglutinative morphology handling
- Semantic representation across morphological variants
- Integration of linguistic knowledge with statistical methods
- Cross-lingual semantic mapping for multilingual applications
Note: This is a research project focused on advancing Turkish language processing. The implementations are experimental and designed for research purposes.