Repository for the Assignment 2 of course CS613, Natural Language Processing on Language Modeling and Smoothing.
In this Assignment, we had to train n-gram models from unigram to quadgram on the dataset provided. We had to implemenent different smoothing techniques and compare the preplexities of the models. Details of each task can be found here - documentation
The repository contains the following folders. Their descriptions are as follows:
- average_preplexity: average_preplexity: This folder contains csv files that contains all the average perplexities over all the models for both train and test dataset.
- dataset: This folder contains csv files that contain all the data. This includes the raw data, processed data, and train and test data subsets.
- perplexities: This folder contains subfolders for all the different smoothing techniques. Each subfolder (i.e., for each smoothing technique), contains csv files that contain the perplexities for each model after smoothing.
- Plots: This folder contains image files of plots for trends of different smoothing techniques over the different models.
The repository contains the following python files. Their descriptions are as follows:
- NGramProcessor.py: This file contains the class for the n-gram model. This includes all the methods to calculate the perplexities for the model as well.
- ngram_train.py: This file contains code that creates, trains, and gets the perplexities of the n-gram model.
- plot_saver.py: This file contains functions to plot and save the average perplexities.
- preprocessing.py: This file contains a function to preprocess the data and save it.
Usage can be found in point 5 of the documentation.