This repository contains a collection of Python scripts that explore fundamental concepts in statistics using real-world datasets. These exercises cover techniques such as percentile-based filtering, Z-score calculations, modified Z-tests, and cosine similarity, enhanced with data visualization using Seaborn and Matplotlib.
| File Name | Description |
|---|---|
1_percentile.py |
Calculates percentiles and removes outliers based on the 90th percentile of household size. |
2_mean_absolute_deviation_standard_deviation_z_value.py |
Performs outlier detection using standard deviation and Z-scores on BMI data. |
3_log.py |
Visualizes highway population data and introduces logarithmic plotting. |
4_Normal.py |
Plots income vs credit limit with log-scaled axes using Seaborn. |
5_cosine.py |
Demonstrates cosine similarity and cosine distance for basic NLP-like document vectors. |
6_modified_z_test.py |
Implements both standard Z-score and Modified Z-score methods for income-based outlier detection. |
modified_z_score.xlsx |
Example Excel sheet supporting the modified Z-score implementation. |
- Python 3.x
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- Descriptive Statistics
- Percentile Analysis
- Outlier Detection (Z-score, Modified Z-score)
- Data Cleaning & Preprocessing
- Cosine Similarity & Distance
- Data Visualization
- Clone the repository:
git clone https://github.com/your-username/statistics_workout.git cd statistics_workout - Install required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn
- Run any script using:
python filename.py
β οΈ Ensure that the necessary CSV files are placed in the correct paths or update the paths in the scripts accordingly.
Click here to download this repository as a ZIP file
If you have questions or suggestions, feel free to reach out via GitHub Issues.