Welcome to the Multimodal Kazakh Research repo β a work-in-progress collection of tools, scripts, and models aimed at advancing research in multimodal learning (mostly vision) for the Kazakh language.
.
βββ projects/
β βββ horde-vision/
β βββ benchmark/ # Benchmarking apps and model request scripts
β βββ evaluate/ # Evaluation and metrics calculation
β βββ scripts/ # Data processing and utility scripts
β βββ train_scripts/ # Training (SFT, RL) and inference scripts
βββ results/ # Evaluation results and visualizations
- πΌοΈ Build and curate high-quality Kazakh-language multimodal datasets
- π€ Train and evaluate multimodal models on Kazakh content
- π§ͺ Support downstream tasks like retrieval, captioning, VQA
Horde Vision Model Performance Summary
| Model | caption | vqa | ocr | reason | instruct_follow | Avg Rank |
|---|---|---|---|---|---|---|
| horde-vision | 83.5 (β12.3%) | 68.1 (β5.3%) | 64.7 (β2.6%) | 77.4 (β5.7%) | 70.5 (β5.9%) | #1 |
| Qolda | 75.2 (β8.7%) | 61.7 (β3.0%) | 60.6 (β2.0%) | 70.3 (β2.9%) | 62.2 (β2.8%) | #2 |
| Qwen3-VL-8B-Instruct | 41.3 (β0.5%) | 53.6 (β1.1%) | 59.3 (β2.1%) | 55.5 (β0.7%) | 49.5 (β0.9%) | #3 |
| gemma-3-4b-it | 42.0 (β0.1%) | 41.8 (β0.4%) | 50.3 (β2.3%) | 53.0 (β0.6%) | 42.5 (β0.5%) | #4 |
| Qwen2.5-VL-7B-Instruct | 35.4 (β0.0%) | 41.6 (β0.4%) | 51.0 (β0.9%) | 44.6 (β0.3%) | 37.7 (β0.3%) | #5 |
| Llama-3.2-11B-Vision | 36.2 (β0.1%) | 38.0 (β0.3%) | 15.0 (β0.1%) | 43.4 (β0.3%) | 36.4 (β0.3%) | #6 |
| InternVL3-8B | 26.1 (β0.6%) | 29.0 (β0.0%) | 29.1 (β0.3%) | 27.3 (β0.0%) | 25.7 (β0.0%) | #7 |

