author: Yuqing Qiao
date: June 2024
- Introduction
- Byte Pair Encoding (BPE) Tokenization
- Sentiment Analysis on Movie Reviews
- Emotion Detection with TensorFlow
- Word Embedding Models
- LLM
This project aims to implement several natural language processing (NLP) techniques from scratch, including Byte Pair Encoding (BPE) for tokenization, sentiment analysis with multiple classifiers, emotion detection using TensorFlow, and word embedding models like Skip-gram and CBOW. I also explore the visualization of classifier decisions and word embeddings to understand the underlying patterns in data.
Byte Pair Encoding (BPE) is a subword tokenization technique that iteratively merges the most frequent pair of bytes or characters in a given text. This approach helps in handling out-of-vocabulary words more effectively.
Performed sentiment analysis on movie reviews, aiming to classify them as positive or negative using Naive Bayes (NB), Logistic Regression (LR), and Multi-Layer Perceptron (MLP) classifiers and compare their performance on the metrics of accuracy, precision, f1-score and recall.
- Naive Bayes, Logistic Regression, and Multi-Layer Perceptron (MLP) classifiers are implemented to predict the sentiment of movie reviews.
- Sentiment Analysis Implementation Folder
- We use t-SNE to visualize the decision boundaries of NB, LR, and MLP classifiers to understand how they separate different topics from all articles.
- Classifiers Visualization Folder
We implement term frequency (TF) and term frequency-inverse document frequency (TF-IDF) from scratch and use them to train a TensorFlow model for classifying six different emotions in corpus.
We explore word embeddings through self-implemented Skip-gram and Continuous Bag of Words (CBOW) models, focusing on capturing semantic relationships between words.
- We visualize word embeddings using PCA and cosine similarity to understand the semantic relationships of word embedding captured by the Skip-gram and CBOW models.
- Word Embeddings Visualization Folder