GitHub - brittLiban/foodApp

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README		README
nutrition.csv		nutrition.csv
nutrition_analysis_and_prediction.ipynb		nutrition_analysis_and_prediction.ipynb
nutrition_cleaned.csv		nutrition_cleaned.csv

Repository files navigation

### How to View This Project
- You can view the notebook directly on GitHub or by downloading it and running it locally in a Jupyter environment.
- For a quick overview, scroll down to the README or open the notebook in GitHub's preview mode.

## Nutrition Analysis and Calorie Prediction Project

This project focuses on exploring, cleaning, and analyzing nutritional data, with the ultimate goal of predicting calorie content using machine learning. The dataset includes a variety of foods and their nutritional information, making it ideal for practicing data preprocessing, exploratory data analysis (EDA), and building a regression model.

---

### Step-by-Step Project Overview

#### Step 1: Loading and Understanding the Data
- **Objective**: Load the nutrition dataset and understand its structure.
- **Actions**:
  - Imported the dataset and inspected its structure, including column names and data types.
  - Examined sample rows to identify any necessary cleaning tasks, such as mixed units (e.g., "mg," "g") and categorical data like food names.

#### Step 2: Data Cleaning and Preprocessing
- **Objective**: Prepare the data for machine learning by converting all values to numeric where possible and handling missing data.
- **Why This Was Important**:
  - Real-world data often contains non-numeric values and missing entries, which must be addressed for accurate analysis and modeling.
- **Steps Taken**:
  1. **Removing Units from Numeric Columns**:
     - Columns included units like "mg" and "g" alongside values, making them non-numeric.
     - Used regular expressions to separate the units and convert values to numeric types, ensuring compatibility with analysis and modeling.
  2. **Handling Missing Values**:
     - Missing values can impact the analysis and model performance. Filled in missing values in numeric columns with the column mean to maintain consistency.
  3. **Label Encoding for Categorical Data**:
     - The "name" column (food names) is categorical. To make it usable for machine learning, each food name was assigned a unique numeric label using label encoding.

#### Step 3: Exploratory Data Analysis (EDA)
- **Objective**: Gain insights into the data to guide modeling decisions.
- **Techniques Used**:
  1. **Descriptive Statistics**:
     - Used `.describe()` to summarize each numeric column, providing insights into central tendencies and variability (mean, min, max, and quartiles).
  2. **Distribution Analysis**:
     - Plotted histograms and box plots for key nutrients (e.g., calories, fat, sugar) to understand their distribution and spot any outliers.
  3. **Correlation Matrix**:
     - Generated a correlation heatmap to visualize relationships between nutrients, helping identify any strongly correlated features that could impact modeling.

#### Step 4: Splitting the Data for Model Training and Testing
- **Objective**: Ensure the model's ability to generalize by testing it on unseen data.
- **Actions**:
  - Split the data into an 80% training set and a 20% testing set.
  - This split allows us to train the model on one portion of the data while evaluating it on another, helping assess model accuracy and robustness.

#### Step 5: Building and Training a Machine Learning Model
- **Objective**: Predict the calorie content based on other nutritional features.
- **Model Selection**:
  - We chose a **Linear Regression** model since we’re predicting a continuous variable (calories).
  - Linear Regression is suitable for modeling relationships between numeric variables.
- **Training**:
  - Trained the model on the training data (features and target calories) to capture patterns between nutrients and calorie values.

#### Step 6: Evaluating the Model
- **Objective**: Assess model performance to ensure reliable predictions.
- **Evaluation Metrics**:
  1. **Mean Absolute Error (MAE)**:
     - MAE calculates the average absolute difference between predicted and actual values, giving a sense of prediction accuracy.
  2. **R-squared (R²)**:
     - R² measures how well the model explains the variance in calorie values. An R² close to 1 indicates a strong fit, meaning the model captures most patterns in the data.
- **Results**:
  - Achieved a low MAE and a high R² score, indicating that the model accurately predicts calorie values and captures the dataset’s patterns effectively.

---

### Final Project Summary and Next Steps

**Key Takeaways**:
- **Data Cleaning**: Converted columns to numeric values, handled missing data, and applied label encoding to categorical data.
- **EDA**: Gained insights into data distributions, correlations, and outliers through statistical summaries and visualizations.
- **Modeling**: Trained a Linear Regression model to predict calorie values based on other nutritional features.
- **Evaluation**: Validated model performance with MAE and R² metrics, confirming that the model is reliable and accurate.

**Potential Next Steps**:
1. **Cross-Validation**: Use cross-validation to check for overfitting and ensure that the model is robust.
2. **Experiment with Different Models**: Try alternative regression models (e.g., Random Forest, Decision Trees) to see if they provide better accuracy.
3. **Feature Engineering**: Create new features or combinations of existing features to improve model performance.

---

### Additional Information and Environment Setup

- **Environment Setup**:
  - Python 3.10 and a virtual environment were used for package management.
  - Installed essential libraries including `pandas` (data manipulation), `scikit-learn` (machine learning), and `matplotlib` (visualization).

- **Version Control**:
  - Used Git to track project changes and document the workflow, ensuring reproducibility and collaboration readiness.

---

This project showcases my ability to clean and preprocess data, perform EDA, and build and evaluate a machine learning model. It reflects a practical application of my skills in data science, making it a valuable addition to my portfolio.

---