A machine learning solution for predicting flood probability based on environmental and geographical factors. This project implements multiple regression models and compares their performance on a large-scale dataset.
The project analyzes 20+ environmental features to predict flood probability, including monsoon intensity, deforestation, infrastructure quality, and climate change indicators. The dataset contains over 1.1 million training samples from the Kaggle Playground Series competition.
Three regression models are implemented and evaluated:
- Linear Regression: R² = 0.828
- Gradient Boosting Regressor: R² = 0.619
- LARS: R² = 0.001
Linear Regression achieves the best performance with MAE of 0.329.
Features include environmental and geographical factors:
- Monsoon Intensity, Topography, River Management
- Deforestation, Urbanization, Climate Change
- Infrastructure and population metrics
- Historical disaster preparedness data
Dataset source: Kaggle Playground Series S4E5
flood_prediction.ipynb # Model training and evaluation
data/flood/
├── train.csv # Training dataset (1,117,957 samples)
├── test.csv # Test dataset (745,305 samples)
└── sample_submission.csv # Submission format
- Python 3.11+
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- Place dataset files in
data/flood/directory - Open and run
flood_prediction.ipynbin Jupyter or VS Code - The notebook performs EDA, feature engineering, and model training
- Generated predictions can be formatted for competition submission
- Outlier detection using IQR method
- Feature standardization using StandardScaler
- No missing values detected in the dataset
- 845,886 samples retained after outlier removal