This project demonstrates how to use PySpark for text classification. It includes various steps such as data preprocessing, feature extraction, and model training using logistic regression.
This project aims to classify text data using PySpark. The notebook walks through the following steps:
- Initialization of SparkSession: Setting up a SparkSession to utilize PySpark.
- Data Loading: Reading the dataset into a Spark DataFrame.
- Data Preprocessing: Tokenization, removal of stop words, and feature extraction using TF-IDF.
- Model Training: Training a Logistic Regression model to classify the text data.
- Evaluation: Evaluating the model's performance using accuracy metrics.
- Python
- PySpark
- Seaborn
- Matplotlib
To run this project locally, you need to have Python and PySpark installed. Follow these steps:
- Clone the repository:
git clone https://github.com/username/repo-name.git
- Navigate to the project directory:
cd repo-name - Install the required Python packages:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook medium.ipynb
To use this notebook:
Ensure you have a suitable dataset for text classification. Modify the notebook to load your dataset. Run through the cells to perform text classification. Review the model's performance metrics at the end.
- Text Preprocessing: Includes tokenization, stop words removal, and TF-IDF vectorization.
- Classification: Logistic Regression model for text classification.
- Evaluation: Accuracy evaluation of the classification model.