Text Classification with PySpark

This project demonstrates how to use PySpark for text classification. It includes various steps such as data preprocessing, feature extraction, and model training using logistic regression.

About the Project

This project aims to classify text data using PySpark. The notebook walks through the following steps:

Initialization of SparkSession: Setting up a SparkSession to utilize PySpark.
Data Loading: Reading the dataset into a Spark DataFrame.
Data Preprocessing: Tokenization, removal of stop words, and feature extraction using TF-IDF.
Model Training: Training a Logistic Regression model to classify the text data.
Evaluation: Evaluating the model's performance using accuracy metrics.

Technologies Used

Python
PySpark
Seaborn
Matplotlib

Setup and Installation

To run this project locally, you need to have Python and PySpark installed. Follow these steps:

Clone the repository:

git clone https://github.com/username/repo-name.git

Navigate to the project directory:
```
cd repo-name
```
Install the required Python packages:
```
pip install -r requirements.txt
```
Run the Jupyter Notebook:
```
jupyter notebook medium.ipynb
```

Usage

To use this notebook:

Ensure you have a suitable dataset for text classification. Modify the notebook to load your dataset. Run through the cells to perform text classification. Review the model's performance metrics at the end.

Features

Text Preprocessing: Includes tokenization, stop words removal, and TF-IDF vectorization.
Classification: Logistic Regression model for text classification.
Evaluation: Accuracy evaluation of the classification model.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data-classification		data-classification
data-merging		data-merging
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Classification with PySpark

Table of Contents

About the Project

Technologies Used

Setup and Installation

Usage

Features

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mysfks/Text-Classification-with-PySpark

Folders and files

Latest commit

History

Repository files navigation

Text Classification with PySpark

Table of Contents

About the Project

Technologies Used

Setup and Installation

Usage

Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages