This project analyzes synthetic data of AI assistant usage among students, performs exploratory data analysis (EDA), feature engineering, preprocessing, and builds a classification model to predict whether a student will use the AI assistant again in the future.
The dataset is publicly available on Kaggle:
AI Assistant Usage in Student Life (Synthetic)
It contains 10,000 records with the following columns:
SessionIDβ Unique identifier for each AI assistant usage sessionStudentLevelβ Education level (e.g., Undergraduate, Graduate, High School)Disciplineβ Field of study (e.g., Computer Science, Psychology)SessionDateβ Date of the AI assistant sessionSessionLengthMinβ Duration of session in minutesTotalPromptsβ Number of prompts sent during the sessionTaskTypeβ Type of task performed (e.g., Studying, Coding)AI_AssistanceLevelβ Level of AI assistance (1 to 5)FinalOutcomeβ Outcome of the session (e.g., Assignment Completed)UsedAgainβ Target variable (1 = Yes, 0 = No)SatisfactionRatingβ Satisfaction rating given by the student
- Checked dataset shape, data types, and missing values
- Summary statistics (
describe()) - Class distribution visualization for the target variable
- Distribution plots for numeric variables
- Count plots for categorical variables
Additional features were created to improve model performance:
- Date-based features extracted from
SessionDate:yearmonthdayofweekdayweekofyear
- Interaction-based feature:
prompts_per_min=TotalPrompts/SessionLengthMin
- Encoding categorical variables with
LabelEncoder(for simplicity) - Scaling numeric features using
StandardScaler - SMOTE applied to handle class imbalance
- Train-test split (80/20 ratio)
Models tested:
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier
Final chosen model: RandomForestClassifier
- Achieved accuracy: ~75%
- Confusion Matrix
- Classification Report (Precision, Recall, F1-score)
- Accuracy Score
imbalanced-learn xgboost