This project analyzes agricultural production data in India using statistical methods and modeling techniques to identify key environmental and soil-related factors influencing crop yield.
Agricultural productivity depends on complex interactions between climate conditions, soil properties, and farming practices.
This project explores crop production data across Indian states to:
- Identify patterns in agricultural output
- Analyze the impact of rainfall, temperature, and soil nutrients
- Build interpretable statistical models for yield prediction
The dataset contains:
- ~100,000 observations
- Soil nutrients, climate variables, and production metrics
👉 Dataset sample:
View full dataset
- Removed redundant variables
- Outlier removal (IQR method)
- Feature engineering (Year_Index, regions, crop categories)
- West region has the highest production
- Strong regional variation in agricultural output
- Rainfall varies significantly across regions
- Temperature differences influence productivity
- Increased rainfall → lower production
- Temperature trends correlate with yield changes
- Food crops dominate total production
- Spices show the lowest production levels
- Strong correlation: Area ↔ Production
- Weak correlations between climate and nutrients
- 5 components explain ~77% of variance
- Soil nutrients negatively related to pH
- Generalized Linear Model (GLM)
- Model selection using AIC
- Best model includes interaction terms
- RMSE ≈ 7285.52
- Model captures general trends but shows variance
- Climate variability strongly impacts crop production
- Excess rainfall negatively affects yield
- Soil nutrients interact with pH
- Area is the strongest production driver
- Implicit time dimension (no explicit years)
- High redundancy in dataset
- Crop-specific yield differences not normalized
- Apply ML models (XGBoost, Random Forest)
- Improve feature engineering
- Integrate external climate datasets
The model implementation is available in the notebooks/Crop_Production.ipynb directory.
Irem Akcan








