NLP && LLM

author: Yuqing Qiao
date: June 2024

Introduction
Byte Pair Encoding (BPE) Tokenization
Sentiment Analysis on Movie Reviews
- Classifiers Implementation
- Visualization of Classifiers
Emotion Detection with TensorFlow
Word Embedding Models
- Skip-gram and Continuous Bag of Words (CBOW)
- Visualization of Word Embeddings
LLM
- BERT
- Transformer

Introduction

This project aims to implement several natural language processing (NLP) techniques from scratch, including Byte Pair Encoding (BPE) for tokenization, sentiment analysis with multiple classifiers, emotion detection using TensorFlow, and word embedding models like Skip-gram and CBOW. I also explore the visualization of classifier decisions and word embeddings to understand the underlying patterns in data.

Byte Pair Encoding (BPE) Tokenization

Byte Pair Encoding (BPE) is a subword tokenization technique that iteratively merges the most frequent pair of bytes or characters in a given text. This approach helps in handling out-of-vocabulary words more effectively.

BPE Implementation Folder

Sentiment Analysis on Movie Reviews

Performed sentiment analysis on movie reviews, aiming to classify them as positive or negative using Naive Bayes (NB), Logistic Regression (LR), and Multi-Layer Perceptron (MLP) classifiers and compare their performance on the metrics of accuracy, precision, f1-score and recall.

Classifiers Implementation

Naive Bayes, Logistic Regression, and Multi-Layer Perceptron (MLP) classifiers are implemented to predict the sentiment of movie reviews.
Sentiment Analysis Implementation Folder

Visualization of Classifiers

We use t-SNE to visualize the decision boundaries of NB, LR, and MLP classifiers to understand how they separate different topics from all articles.
Classifiers Visualization Folder

Emotion Detection with TensorFlow

We implement term frequency (TF) and term frequency-inverse document frequency (TF-IDF) from scratch and use them to train a TensorFlow model for classifying six different emotions in corpus.

Emotion Detection Implementation Folder

Word Embedding Models

We explore word embeddings through self-implemented Skip-gram and Continuous Bag of Words (CBOW) models, focusing on capturing semantic relationships between words.

Skip-gram and Continuous Bag of Words (CBOW)

Word Embedding Models Folder

Visualization of Word Embeddings

We visualize word embeddings using PCA and cosine similarity to understand the semantic relationships of word embedding captured by the Skip-gram and CBOW models.
Word Embeddings Visualization Folder

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BERT		BERT
LLMs		LLMs
NGram		NGram
Transformer		Transformer
bpe		bpe
classifier		classifier
sentimental analysis		sentimental analysis
word embedding		word embedding
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP && LLM

Table of Contents

Introduction

Byte Pair Encoding (BPE) Tokenization

Sentiment Analysis on Movie Reviews

Classifiers Implementation

Visualization of Classifiers

Emotion Detection with TensorFlow

Word Embedding Models

Skip-gram and Continuous Bag of Words (CBOW)

Visualization of Word Embeddings

LLM

BERT

Transformer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

williamQyq/NLP

Folders and files

Latest commit

History

Repository files navigation

NLP && LLM

Table of Contents

Introduction

Byte Pair Encoding (BPE) Tokenization

Sentiment Analysis on Movie Reviews

Classifiers Implementation

Visualization of Classifiers

Emotion Detection with TensorFlow

Word Embedding Models

Skip-gram and Continuous Bag of Words (CBOW)

Visualization of Word Embeddings

LLM

BERT

Transformer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages