brittLiban/foodApp
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
### How to View This Project
- You can view the notebook directly on GitHub or by downloading it and running it locally in a Jupyter environment.
- For a quick overview, scroll down to the README or open the notebook in GitHub's preview mode.
## Nutrition Analysis and Calorie Prediction Project
This project focuses on exploring, cleaning, and analyzing nutritional data, with the ultimate goal of predicting calorie content using machine learning. The dataset includes a variety of foods and their nutritional information, making it ideal for practicing data preprocessing, exploratory data analysis (EDA), and building a regression model.
---
### Step-by-Step Project Overview
#### Step 1: Loading and Understanding the Data
- **Objective**: Load the nutrition dataset and understand its structure.
- **Actions**:
- Imported the dataset and inspected its structure, including column names and data types.
- Examined sample rows to identify any necessary cleaning tasks, such as mixed units (e.g., "mg," "g") and categorical data like food names.
#### Step 2: Data Cleaning and Preprocessing
- **Objective**: Prepare the data for machine learning by converting all values to numeric where possible and handling missing data.
- **Why This Was Important**:
- Real-world data often contains non-numeric values and missing entries, which must be addressed for accurate analysis and modeling.
- **Steps Taken**:
1. **Removing Units from Numeric Columns**:
- Columns included units like "mg" and "g" alongside values, making them non-numeric.
- Used regular expressions to separate the units and convert values to numeric types, ensuring compatibility with analysis and modeling.
2. **Handling Missing Values**:
- Missing values can impact the analysis and model performance. Filled in missing values in numeric columns with the column mean to maintain consistency.
3. **Label Encoding for Categorical Data**:
- The "name" column (food names) is categorical. To make it usable for machine learning, each food name was assigned a unique numeric label using label encoding.
#### Step 3: Exploratory Data Analysis (EDA)
- **Objective**: Gain insights into the data to guide modeling decisions.
- **Techniques Used**:
1. **Descriptive Statistics**:
- Used `.describe()` to summarize each numeric column, providing insights into central tendencies and variability (mean, min, max, and quartiles).
2. **Distribution Analysis**:
- Plotted histograms and box plots for key nutrients (e.g., calories, fat, sugar) to understand their distribution and spot any outliers.
3. **Correlation Matrix**:
- Generated a correlation heatmap to visualize relationships between nutrients, helping identify any strongly correlated features that could impact modeling.
#### Step 4: Splitting the Data for Model Training and Testing
- **Objective**: Ensure the model's ability to generalize by testing it on unseen data.
- **Actions**:
- Split the data into an 80% training set and a 20% testing set.
- This split allows us to train the model on one portion of the data while evaluating it on another, helping assess model accuracy and robustness.
#### Step 5: Building and Training a Machine Learning Model
- **Objective**: Predict the calorie content based on other nutritional features.
- **Model Selection**:
- We chose a **Linear Regression** model since we’re predicting a continuous variable (calories).
- Linear Regression is suitable for modeling relationships between numeric variables.
- **Training**:
- Trained the model on the training data (features and target calories) to capture patterns between nutrients and calorie values.
#### Step 6: Evaluating the Model
- **Objective**: Assess model performance to ensure reliable predictions.
- **Evaluation Metrics**:
1. **Mean Absolute Error (MAE)**:
- MAE calculates the average absolute difference between predicted and actual values, giving a sense of prediction accuracy.
2. **R-squared (R²)**:
- R² measures how well the model explains the variance in calorie values. An R² close to 1 indicates a strong fit, meaning the model captures most patterns in the data.
- **Results**:
- Achieved a low MAE and a high R² score, indicating that the model accurately predicts calorie values and captures the dataset’s patterns effectively.
---
### Final Project Summary and Next Steps
**Key Takeaways**:
- **Data Cleaning**: Converted columns to numeric values, handled missing data, and applied label encoding to categorical data.
- **EDA**: Gained insights into data distributions, correlations, and outliers through statistical summaries and visualizations.
- **Modeling**: Trained a Linear Regression model to predict calorie values based on other nutritional features.
- **Evaluation**: Validated model performance with MAE and R² metrics, confirming that the model is reliable and accurate.
**Potential Next Steps**:
1. **Cross-Validation**: Use cross-validation to check for overfitting and ensure that the model is robust.
2. **Experiment with Different Models**: Try alternative regression models (e.g., Random Forest, Decision Trees) to see if they provide better accuracy.
3. **Feature Engineering**: Create new features or combinations of existing features to improve model performance.
---
### Additional Information and Environment Setup
- **Environment Setup**:
- Python 3.10 and a virtual environment were used for package management.
- Installed essential libraries including `pandas` (data manipulation), `scikit-learn` (machine learning), and `matplotlib` (visualization).
- **Version Control**:
- Used Git to track project changes and document the workflow, ensuring reproducibility and collaboration readiness.
---
This project showcases my ability to clean and preprocess data, perform EDA, and build and evaluate a machine learning model. It reflects a practical application of my skills in data science, making it a valuable addition to my portfolio.
---