From LDA to BERTopic: Advancing Topic Coherence in Analyzing Podcast Transcripts

Overview

This project explores the application of topic modeling techniques on podcast transcripts as part of a research note for a Natural Language Processing class. The analysis compares Latent Dirichlet Allocation (LDA) and BERTopic models, focusing on their coherence scores and the effectiveness of transformer-based embeddings in improving topic analysis.

Motivation

Podcasts contain rich and diverse content, making them ideal for analyzing nuanced topics in large datasets. This project aims to:

Compare traditional (LDA) and modern (BERTopic) topic modeling techniques.
Evaluate the coherence of topics generated by both models.
Investigate how transformer-based embeddings enhance topic modeling performance.

Citations

This project utilized the Structured Podcast Research Corpus (SPORC) dataset. We acknowledge the following work, and appreciate for providing a baseline for analysing podcast transcripts using modern NLP methods:

@misc{litterer2024mappingpodcastecosystemstructured, title={Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus}, author={Benjamin Litterer and David Jurgens and Dallas Card}, year={2024}, eprint={2411.07892}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.07892}, }

For more details about the dataset, please refer to their arXiv paper.

Key Findings

LDA Coherence Score: 0.55 (Moderate)
BERTopic Coherence Score: 0.63 (Good)
BERTopic, leveraging transformer-based embeddings, outperforms LDA in semantic topic coherence, particularly in handling nuanced and varied content.

Research Question

"How do LDA and BERTopic compare in generating semantically meaningful topics from podcast transcripts?"

Features

Preprocessing of podcast transcripts, including stop word removal and lemmatization.
Implementation of LDA and BERTopic for topic modeling.
Evaluation of model performance using coherence scores.
Visualization of topic distributions and trends over time.

Installation

Prerequisites

Ensure you have Python 3.8 or above and the following libraries installed:

pip install pandas numpy matplotlib seaborn nltk spacy scikit-learn bertopic sentence-transformers umap-learn

Clone the Repository

git clone https://github.com/yourusername/SPORC_TopicModelling.git
cd your-repo-name

Usage

Preprocess the dataset using the provided scripts for cleaning and tokenization.
Run the topic_modeling.py script to generate and evaluate topics using LDA and BERTopic.
Visualize topic distributions and coherence scores using visualizations.py.

Example Code

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Preprocess dataset
data = preprocess_transcripts("path/to/dataset.csv")

# Apply BERTopic
bertopic_model = BERTopic()
bertopic_topics, probs = bertopic_model.fit_transform(data["transcripts"])

# Apply LDA
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_topics = lda_model.fit_transform(data["bow_matrix"])

# Evaluate and visualize
plot_topic_coherence(bertopic_model, lda_model)

Results

BERTopic generated more semantically coherent topics than LDA, as reflected by higher coherence scores.
Transformer-based embeddings proved effective in handling diverse and nuanced podcast content.

Visualizations

Topic Distributions: Bar charts showcasing the most frequent topics.
UMAP Clustering: Dimensionality reduction for visualizing topic embeddings.
Topics Over Time: Trends in topic prevalence over specific periods (e.g., May-June 2020).

Project Structure

|-- podcast/                # Please Create this Folder and Download the Sample Dataset 
|-- SPoRC_Analysis/           # Jupyter notebooks for exploration   
|-- README.md            # Project overview

Sample DataSet: Please download from this link: Hugging Face Library: https://huggingface.co/datasets/blitt/SPoRC/tree/main

Future Work

Explore additional transformer-based models (e.g., RoBERTa, BERT).
Extend analysis to multilingual podcast datasets.
Incorporate bias analysis and mitigation in topic modeling.

Contributors

Nima Thing: LinkedIn | GitHub

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
SPoRC_Podcast_Analysis.ipynb		SPoRC_Podcast_Analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From LDA to BERTopic: Advancing Topic Coherence in Analyzing Podcast Transcripts

Overview

Motivation

Citations

Key Findings

Research Question

Features

Installation

Prerequisites

Clone the Repository

Usage

Example Code

Results

Visualizations

Project Structure

Future Work

Contributors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From LDA to BERTopic: Advancing Topic Coherence in Analyzing Podcast Transcripts

Overview

Motivation

Citations

Key Findings

Research Question

Features

Installation

Prerequisites

Clone the Repository

Usage

Example Code

Results

Visualizations

Project Structure

Future Work

Contributors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages