This repository contains a Neural Machine Translation (NMT) project implemented in Python. The project includes a Jupyter Notebook (NMTranslation.ipynb) that demonstrates the end-to-end process of building a French-to-English translation model using sequence modeling techniques. The model includes Luong attention and scaled dot-product attention mechanisms.
DATASET INFO
- Dataset used - Tatoeba - French to English sentence pairs
- No. of Sentence Pairs - 400,000 +
config.py- Centralized configuration for hyperparameters, file paths, and device settingstrain.py- Main script that loads data, builds tokenizers, trains both models, and saves theminference.py- Provides translation functions and loads saved models for inferenceeval.py- to evaluate model performance with bleu scores for seq2seq and attention models
Models Directory
attention.py- Implements Luong attention mechanism for the attention-based decoderencoder.py- Contains both EncoderNoAttention and EncoderWithAttention classesdecoder.py- Contains both DecoderNoAttention and DecoderWithAttention classesseq2seq.py- Wraps encoder-decoder pairs into complete Seq2Seq models
Utils Directory
tokenizer.py- Custom tokenizer class for converting text to sequences and vice versapreprocessing.py- Functions for Unicode normalization, sentence preprocessing, and dataset processingdataset.py- Masked cross-entropy loss and DataLoader creation utilities