- Follow 'docs/env_setup.md' to install dependencies
- Review 'src/utils/data_ingestion.py' & 'src/utils/data_cli.py'
- Run 'pytest' to ensure initial tests pass
pip install pre-commit
pre-commit install
The data ingestion module (src/utils/data_ingestion.py) provides functionality to:
- Read text data from multiple file formats (JSON, CSV, and plain text)
- Clean and preprocess text (remove URLs, mentions, hashtags, and special characters)
- Tokenize text into words for further analysis
The data ingestion module can be tested using the command-line interface:
# Process a text file and output to stdout (limited to 10 documents by default)
python3 src/utils/data_cli.py -i data/sample/sample.txt
# Process a CSV file and save to output file
python3 src/utils/data_cli.py -i data/sample/sample.csv -o data/processed/output.json
# Process a JSON file with verbose output and custom document limit
python3 src/utils/data_cli.py -i data/sample/sample.json -o data/processed/output.json -v -l 100- JSON: Expects objects with "id" and "text" fields (or similar)
- CSV: Expects CSV with header row and a "text" column
- TXT: One document per line, generates IDs automatically
The output is a JSON Lines file where each line contains a preprocessed document with the following fields:
id: Document ID (original or auto-generated)text: Original document textclean_text: Cleaned text with special characters, mentions, hashtags, and URLs removedtokens: List of tokenized words from the cleaned text
The data preprocessor module (src/utils/data_preprocessor.py) prepares data for MapReduce jobs:
- Formats documents for word count and sentiment analysis jobs
- Splits data into multiple parts for parallel processing
- Stores intermediate data in the format expected by MapReduce jobs
You can test the data preprocessor module using the test script:
# Test the preprocessor with a text file
python3 src/utils/test_preprocessor.py -i data/sample/sample.txt -o data/processed
# Custom number of parts and document limit
python3 src/utils/test_preprocessor.py -i data/sample/sample.csv -o data/processed -p 4 -l 20The project uses a centralized logging system (src/utils/logging_config.py) that provides:
- Consistent logging across all modules
- Log output to both console and file
- Different log levels (INFO, DEBUG, WARNING, ERROR)
To use the logging system in a module:
from utils.logging_config import get_logger
# Create a logger for the module
logger = get_logger(__name__)
# Use the logger
logger.info("Processing started")
logger.debug("Detailed information")
logger.warning("Warning message")
logger.error("Error message")To enable verbose (DEBUG) output, use the -v flag with the CLI tools.
To run the sentiment MapReduce job:
python run_pipeline.py