Skip to content

williamQyq/NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP && LLM

author: Yuqing Qiao
date: June 2024

Table of Contents

Introduction

This project aims to implement several natural language processing (NLP) techniques from scratch, including Byte Pair Encoding (BPE) for tokenization, sentiment analysis with multiple classifiers, emotion detection using TensorFlow, and word embedding models like Skip-gram and CBOW. I also explore the visualization of classifier decisions and word embeddings to understand the underlying patterns in data.

Byte Pair Encoding (BPE) Tokenization

Byte Pair Encoding (BPE) is a subword tokenization technique that iteratively merges the most frequent pair of bytes or characters in a given text. This approach helps in handling out-of-vocabulary words more effectively.

Sentiment Analysis on Movie Reviews

Performed sentiment analysis on movie reviews, aiming to classify them as positive or negative using Naive Bayes (NB), Logistic Regression (LR), and Multi-Layer Perceptron (MLP) classifiers and compare their performance on the metrics of accuracy, precision, f1-score and recall.

Classifiers Implementation

Visualization of Classifiers

  • We use t-SNE to visualize the decision boundaries of NB, LR, and MLP classifiers to understand how they separate different topics from all articles.
  • Classifiers Visualization Folder

Emotion Detection with TensorFlow

We implement term frequency (TF) and term frequency-inverse document frequency (TF-IDF) from scratch and use them to train a TensorFlow model for classifying six different emotions in corpus.

Word Embedding Models

We explore word embeddings through self-implemented Skip-gram and Continuous Bag of Words (CBOW) models, focusing on capturing semantic relationships between words.

Skip-gram and Continuous Bag of Words (CBOW)

Visualization of Word Embeddings

  • We visualize word embeddings using PCA and cosine similarity to understand the semantic relationships of word embedding captured by the Skip-gram and CBOW models.
  • Word Embeddings Visualization Folder

LLM

BERT

Transformer

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors