A Python-based CSV data analysis pipeline that parses, cleans, validates, and analyzes tabular data. The project generates a structured, human-readable report with key statistics and insights.
- Practice working with real-world CSV data
- Build a full data pipeline using pure Python
- Understand data validation, cleaning, and analysis
- Generate automated analytical reports
- Source: Student Social Media & Relationships dataset
- Format: CSV
- Each row represents one student's anonymized survey response
- Load CSV data into Python dictionaries
- Validate structure and detect missing values
- Clean and convert data types
- Perform statistical analysis
- Generate a formatted text report
- CSV parsing using Python standard library
- Data validation and missing-value detection
- Safe type conversion with error handling
- Statistical analysis (averages, top-N values)
- Grouping by categorical fields (country)
- Automated report generation
- Python 3.10+
- csv (standard library)
- collections (defaultdict)
- Clone the repository
- Ensure Python 3.10+ is installed
- Run:
python main.py
This project was completed as a learning mini-project after the first month of my Machine Learning self-study plan. The project was completed in several iterations, with correction of logical and analytical errors.
- Refactor into multiple modules
- Add pandas-based implementation
- Add visualizations
- Extend to machine learning tasks