Welcome to SindhiMate, an open-source initiative to preserve and digitize the rich heritage of the Sindhi script through a unique multi-modal dataset of handwritten Sindhi characters and their pronunciations. We are thrilled to announce that the Digital Sindhi Script Dataset is now publicly available on Kaggle! ๐ This dataset, developed as part of the final year project of students Shayan Ali Shaikh, Muhammad Hamza Shaikh, and Hamna Rajput under the supervision of Dr. Attaullah Sahito, is designed to empower research in machine learning, optical character recognition (OCR), natural language processing (NLP), computer vision, speech processing, and cultural heritage preservation.
๐ Dataset Link: Digital Sindhi Script on Kaggle
The Digital Sindhi Script Dataset is a carefully curated, multi-modal collection of handwritten Sindhi characters paired with their corresponding pronunciation data. It aims to advance computational research while celebrating the cultural and linguistic significance of the Sindhi language. The dataset is structured to support machine learning tasks, with pre-processed splits for training, testing, and validation.
- Multi-Modal Data: Includes high-resolution images of handwritten Sindhi characters and audio recordings of their pronunciations, enabling cross-modal research in vision and speech.
- Diverse Data: Collected from over 30 schools across Shikarpur, Pakistan, ensuring a wide variety of handwriting styles and speaker demographics.
- High-Quality Annotations: Images labeled and augmented using Roboflow for precise character annotations; audio data annotated for accurate pronunciation mapping (audio data will be released soon).
- ML-Ready: Processed with OpenCV for image data and standardized audio (audio data will be released soon) processing for pronunciation data, providing organized train, test, and validation sets.
- Public Access: Hosted on Kaggle with compliance to data-sharing standards, balancing accessibility with contributor rights.
The creation of SindhiMate was a collaborative effort undertaken as a final year project by students Shayan Ali Shaikh, Muhammad Hamza Shaikh, and Hamna Rajput, under the guidance of Dr. Attaullah Sahito and co-supervisor Mr. Asadullah Bhatti. The process involved:
- Data Collection: Gathered handwritten Sindhi character samples and corresponding pronunciation recordings from students and educators across 30+ schools in Shikarpur, ensuring diversity in writing styles and vocal characteristics.
- Annotation: Image annotations performed using Roboflow for accurate character labeling and augmentation. Audio (audio data will be released soon) data is annotated to map pronunciations to characters, ensuring consistency.
- Preprocessing: Images processed using OpenCV to split into training, testing, and validation sets. Audio data is standardized for compatibility with speech processing pipelines.
- Images: High-resolution scans of handwritten Sindhi characters, organized by character class.
- Pronunciations: Audio recordings of Sindhi character pronunciations, labeled to correspond with image data.
- Splits: Pre-processed train (70%), test (20%), and validation (10%) sets for both image and audio data, ready for machine learning workflows.
We invite researchers, linguists, AI enthusiasts, and cultural heritage advocates to explore the Digital Sindhi Script Dataset for projects in:
- Natural Language Processing (NLP): Develop models for Sindhi text recognition and speech-to-text systems.
- Computer Vision: Build and test OCR systems for handwritten Sindhi script.
- Speech Processing: Create models for pronunciation recognition and synthesis of Sindhi characters.
- Cultural Heritage Preservation: Contribute to the digitization and preservation of the Sindhi language through multi-modal AI applications.
๐ข Get Started: Access the dataset on Kaggle. Please review the terms of use to ensure compliance with contributor rights.
We encourage the community to use this dataset and share their findings! If you create models, visualizations, or research papers using SindhiMate, please:
- Open a pull request to share your code or insights.
- Tag us in your publications or projects to highlight your work.
- Provide feedback or suggest improvements via GitHub Issues.
This project would not have been possible without the incredible support of:
- Schools in Shikarpur: For providing diverse handwritten samples and pronunciation recordings.
- Annotators: For their meticulous work in labeling image and audio data.
- Kaggle Team: For their guidance in making the dataset publicly accessible.
- Supervisors:
- Dr. Attaullah Sahito: For invaluable guidance and leadership throughout the final year project.
- Mr. Asadullah Bhatti: For expertise and support as co-supervisor.
- Shayan Ali Shaikh (@shayanalishaikh)
- Muhammad Hamza Shaikh
- Hamna Rajput
Made with contrib.rocks.
The Digital Sindhi Script Dataset is available under the terms specified on the Kaggle dataset page. Please review the license and terms to ensure proper use and attribution.
For questions, collaboration opportunities, or data access requests, please reach out via GitHub Issues or contact the dataset creator on Kaggle (@shayanalishaikh).
Let's celebrate the digital future of the Sindhi script together! ๐ Join us in advancing multi-modal AI research and preserving cultural heritage.