Tashkila

Project Pipeline:

Data Pre-Processing:

Cleaning and processing input dataset.
Tokenization using byte pair encoding.
Feature extraction.
Continuous bag of words (CBOW).
Embedding layer.
One hot encoding.

Training our Model:

Manipulating all parameters like number of layers, embedding size, learning rate, number of epochs, batch size, etc.
Saving and loading the model with checkpoints and evaluating them to know how DER is affected.
Calculating loss:
- The loss is calculated after each batch.
- Total cross-entropy is calculated after each epoch.
- Updating weights and repeating.

Evaluation:

Generating the files as stated in the instructions.
Calculating the DER.

Data Pre-Processing:

Deleting Arabic and English numbers.
Deleting and reducing spaces.
Splitting on punctuation marks that clearly split sentences.
Splitting the data into X, X_words, and Y if the data is labeled and saving them as letters and words.
Merging all the text found in the training and validation dataset into one text file for tokenization purposes.

Samples:

Models:

Sequence to Sequence (Word-Letter):
- Description: Encoder and decoder built using LSTM. Encoder input is the sequence of words of the sentence, and the decoder input is the characters of the sentence.
- Best Accuracy: 76%
Sequence to Sequence (Letter-Letter):
- Description: Similar to the previous model but with encoder input as the sequence of letters and decoder input as a sequence of labels.
- Best Accuracy: 75%
Sequence to Sequence with Attention (Word-Letter):
- Description: Used attention mechanism to focus on specific parts of the sentence.
- Not trained due to high computational cost.
Decoder (LSTM):
- Description: Consists of only the decoder. Input is the character to be diacritized, output is the diacritics.
- Best Accuracy: 92%
(ACCIO) Multi-Layer Bidirectional LSTM:
- Description: Consists of 3 bidirectional LSTM layers. Takes CBOW-generated embeddings as initialization.
- Best Accuracy: 98.6%
(ACCIO) Multi-Layer Bidirectional LSTM with Word2Vec:
- Description: Similar to the previous model but with Word2Vec embeddings.
- Accuracy: Comparable to the previous model.
(ACCIO) Multi-Layer with Deeper Layers:
- Description: Similar to the normal ACCIO model but with more and deeper fully connected layers at the end.
- Accuracy: Comparable to the previous model but weaker during training.

Feature Extraction:

CBOW Word2Vec:
- Implementation of CBOW from lectures.
- Used different window sizes to capture longer contexts.
- Generated sentence embedding by averaging word embeddings.
One Hot Encoding
Trainable Embedding Layer:
- Embedding layer used for word representations.

DER for all Trials:

Sequence to Sequence (Word-Letter): 24%
Sequence to Sequence (Letter-Letter): 25%
Decoder: 8%
ACCIO: 1.4%

The model used in the submission is ACCIO due to its lowest DER.

Demos:

Input:

ذهب محمد وفادي إلى المدرسة وهناك يلتقيان بأصدقائهما ، ويتعلمان الكثير من الأشياء الجديدة والمفيدة.

Output:

ذَهَبَ مُحَمَّدٌ وَفَادِي إلَى الْمَدْرَسَةِ وَهُنَاكَ يَلْتَقِيَانِ بِأَصْدِقَائِهِمَا ، وَيَتَعَلَّمَانِ الْكَثِيرَ مِنْ الْأَشْيَاءِ الْجَدِيدَةِ وَالْمُفِيدَةِ

Input:

صعد أدهم وحمزة الشجرة حتى يقطعا الطريق على الأشرار وينقذا الأميرة ولكن لم ينجحا في ذلك ولكنهما تمكنا من الهرب والعودة إلى القرية وتقرر الأميرة الزواج من أدهم وتعيشا سعيدين معا، ولكن قال حمزة: لا يوجد أشرار على الشجرة

Output:

صُعِدْ أَدْهَمَ وَحَمْزَةَ الشَّجَرَةِ حَتَّى يَقْطَعَا الطَّرِيقَ عَلَى الْأَشْرَارِ وَيُنْقِّذَا الْأَمِيرَةَ وَلَكِنْ لَمْ يُنَجِّحَا فِي ذَلِكَ وَلَكِنَّهُمَا تَمَكَّنَا مِنْ الْهَرَبِ وَالْعَوْدَةِ إِلَى الْقَرْيَةِ وَتَقَرَّرُ الْأَمِيرَةُ الزَّوَاجُ مِنْ أَدْهِمَ وَتَعَيُّشًا سَعِيدَيْنِ مَعًا، وَلَكِنْ قَالَ حَمْزَةُ لَا يُوجَدُ أَشْرَارٌ عَلَى الشَّجَرَةِ

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Delivery		Delivery
constants		constants
dataset		dataset
embedding		embedding
images		images
models		models
tokenization		tokenization
.gitignore		.gitignore
CBOW.ipynb		CBOW.ipynb
CBOW.py		CBOW.py
NLP-Project-F23.pdf		NLP-Project-F23.pdf
README.md		README.md
accio.py		accio.py
bilstm.py		bilstm.py
byte_pair_encoding.py		byte_pair_encoding.py
decoder_training.ipynb		decoder_training.ipynb
decoder_training.py		decoder_training.py
embedding_weights.npy		embedding_weights.npy
generate_labels.ipynb		generate_labels.ipynb
generate_labels.py		generate_labels.py
letters_dataset.py		letters_dataset.py
out_encoder.py		out_encoder.py
testing.ipynb		testing.ipynb
text_clean.ipynb		text_clean.ipynb
text_clean.py		text_clean.py
text_encoder.py		text_encoder.py
tokenize.py		tokenize.py
train_collections.py		train_collections.py
training.ipynb		training.ipynb
training.py		training.py
words_dataset.py		words_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tashkila

Project Pipeline:

Data Pre-Processing:

Training our Model:

Evaluation:

Data Pre-Processing:

Samples:

Models:

Feature Extraction:

DER for all Trials:

Demos:

Input:

Output:

Input:

Output:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tashkila

Project Pipeline:

Data Pre-Processing:

Training our Model:

Evaluation:

Data Pre-Processing:

Samples:

Models:

Feature Extraction:

DER for all Trials:

Demos:

Input:

Output:

Input:

Output:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages