Distributed Text Mining & Sentiment Analysis

Day 1: Setup & Data Pipeline

Follow 'docs/env_setup.md' to install dependencies
Review 'src/utils/data_ingestion.py' & 'src/utils/data_cli.py'
Run 'pytest' to ensure initial tests pass

Installation of pre-commit hooks for black/flake8

pip install pre-commit

pre-commit install

Data Ingestion Module

The data ingestion module (src/utils/data_ingestion.py) provides functionality to:

Read text data from multiple file formats (JSON, CSV, and plain text)
Clean and preprocess text (remove URLs, mentions, hashtags, and special characters)
Tokenize text into words for further analysis

CLI Usage

The data ingestion module can be tested using the command-line interface:

# Process a text file and output to stdout (limited to 10 documents by default)
python3 src/utils/data_cli.py -i data/sample/sample.txt

# Process a CSV file and save to output file
python3 src/utils/data_cli.py -i data/sample/sample.csv -o data/processed/output.json

# Process a JSON file with verbose output and custom document limit
python3 src/utils/data_cli.py -i data/sample/sample.json -o data/processed/output.json -v -l 100

Supported File Formats

JSON: Expects objects with "id" and "text" fields (or similar)
CSV: Expects CSV with header row and a "text" column
TXT: One document per line, generates IDs automatically

Output Format

The output is a JSON Lines file where each line contains a preprocessed document with the following fields:

id: Document ID (original or auto-generated)
text: Original document text
clean_text: Cleaned text with special characters, mentions, hashtags, and URLs removed
tokens: List of tokenized words from the cleaned text

Data Preprocessor Module

The data preprocessor module (src/utils/data_preprocessor.py) prepares data for MapReduce jobs:

Formats documents for word count and sentiment analysis jobs
Splits data into multiple parts for parallel processing
Stores intermediate data in the format expected by MapReduce jobs

Testing Preprocessor

You can test the data preprocessor module using the test script:

# Test the preprocessor with a text file
python3 src/utils/test_preprocessor.py -i data/sample/sample.txt -o data/processed

# Custom number of parts and document limit
python3 src/utils/test_preprocessor.py -i data/sample/sample.csv -o data/processed -p 4 -l 20

Logging System

The project uses a centralized logging system (src/utils/logging_config.py) that provides:

Consistent logging across all modules
Log output to both console and file
Different log levels (INFO, DEBUG, WARNING, ERROR)

To use the logging system in a module:

from utils.logging_config import get_logger

# Create a logger for the module
logger = get_logger(__name__)

# Use the logger
logger.info("Processing started")
logger.debug("Detailed information")
logger.warning("Warning message")
logger.error("Error message")

To enable verbose (DEBUG) output, use the -v flag with the CLI tools.

Sentiment Analysis Job

To run the sentiment MapReduce job:

python run_pipeline.py

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github/workflows		.github/workflows
benchmark_results		benchmark_results
data		data
docs		docs
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Project_plan.md		Project_plan.md
bench_wordcount.sh		bench_wordcount.sh
benchmark_processing_time.py		benchmark_processing_time.py
env_setup.sh		env_setup.sh
generate_synthetic_data.py		generate_synthetic_data.py
lexicon.csv		lexicon.csv
mrjob.conf		mrjob.conf
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Text Mining & Sentiment Analysis

Day 1: Setup & Data Pipeline

Installation of pre-commit hooks for black/flake8

Data Ingestion Module

CLI Usage

Supported File Formats

Output Format

Data Preprocessor Module

Testing Preprocessor

Logging System

Sentiment Analysis Job

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

oELYAo/DNP_project

Folders and files

Latest commit

History

Repository files navigation

Distributed Text Mining & Sentiment Analysis

Day 1: Setup & Data Pipeline

Installation of pre-commit hooks for black/flake8

Data Ingestion Module

CLI Usage

Supported File Formats

Output Format

Data Preprocessor Module

Testing Preprocessor

Logging System

Sentiment Analysis Job

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages