This repository contains a series of hands-on projects completed as part of the Yandex Practicum Data Science program. Each folder or notebook corresponds to a real-world case study, showcasing skills in data analysis, statistical modeling, machine learning, and more.
Yandex Practicum’s Data Science track provides an immersive curriculum covering Python programming, data preprocessing, exploratory data analysis (EDA), statistical testing, machine learning model development, and deployment. Students work on diverse projects simulating business challenges in industries such as automotive, oil & gas, micromobility, gaming, real estate, agriculture, HR, telecom, retail, transportation, and content moderation.
Folder: auto-ml (gradient boosting)
Notebook: auto-ml.ipynb
Description:
A used-car pricing service builds an app to help owners quickly estimate their vehicle’s market value. Using historical data on technical specifications, trim levels, and sale prices, you will develop and tune a gradient-boosting regression model optimized for prediction quality, inference speed, and training time.
Folder: boreholes-ml
Notebook: boreholes-ml.ipynb
Description:
An energy company must decide where to drill new wells. Given samples from three regions with oil quality and reserve volume data, build a model to predict future yields. Then apply bootstrap sampling to estimate total profit and risk for each region, and recommend the region with the highest expected return.
Folder: data-analysis
Notebook: e-scooters-analysis.ipynb
Description:
Perform statistical and exploratory analysis for a scooter-sharing service. Load user, trip, and subscription data; preprocess and merge; calculate revenue metrics for free vs. subscription plans; and test hypotheses to guide pricing and marketing strategies.
Folder: data-analysis
Notebook: games-analysis.ipynb
Description:
Analyze global video game sales, user and critic ratings, genres, and platforms. Identify patterns that drive success and provide actionable insights for product development and promotional campaigns.
Folder: data-analysis
Notebook: real-estate-analysis.ipynb
Description:
Explore historical housing listings: clean anomalies, engineer features (price per square meter, floor type, distance to center, etc.), visualize distributions, and determine the key factors influencing property prices.
Folder: farm-ml
Notebook: farm-ml.ipynb
Description:
A dairy farm owner wants to select cows that will produce at least 6,000 kg of milk per year with a desirable taste. Build two models: a regression model to predict annual yield and a classification model to predict milk taste. Finally, recommend which cows to purchase based on both criteria.
Folder: hr-ml
Notebook: hr-ml.ipynb
Description:
HR analytics for a large organization: predict employee satisfaction scores from survey data and model the probability of churn. Provide actionable recommendations to reduce turnover and associated costs.
Folder: market-ml
Notebook: market-ml.ipynb
Description:
An e-commerce retailer wants to personalize offers for loyal customers to boost engagement. Using transaction, web-behavior, and communication data, label customers by activity level, build supervised models with pipelines, evaluate feature importance via SHAP, and perform customer segmentation with business recommendations.
Folder: final-project
Notebook: final-project.ipynb
Description:
For a telecom operator, develop a model to predict contract cancellations. Combine contract, personal, internet, and phone service data to train and evaluate churn-prediction models. Deliver insights to target high-risk subscribers with retention offers.
Folder: neural-networks
Notebook / Script:
computer-vision-project.ipynbcomputer-vision-project.py
Description:
Implement a computer-vision pipeline to estimate customer age group at supermarket checkouts. Preprocess image data, train a neural network, and build an inference workflow for personalized marketing and age-restricted sales compliance.
Folder: taxi-ml (time-series)
Notebook: taxi-ml.ipynb
Description:
Forecast hour-ahead taxi demand at airports using time-series modeling. The goal is to achieve RMSE ≤ 48 on the test set to ensure reliable driver allocation during peak periods.
Folder: text-ml
Notebooks:
text-ml.ipynbtext-ml-v1 BERT.ipynbtraining-task-texts.ipynb
Description:
Train a text-classification model to detect toxic user comments on a product review platform. Using labeled data (toxic_comments.csv), achieve F1 ≥ 0.75 to enable automated moderation of harmful content.
- Languages: Python
- Libraries: pandas, NumPy, matplotlib, scikit-learn, statsmodels, XGBoost / LightGBM, TensorFlow / PyTorch, SHAP
- Environments: Jupyter Notebook, GitHub
- Techniques:
- Data cleaning & preprocessing
- Exploratory Data Analysis (EDA)
- Statistical hypothesis testing & A/B testing
- Regression (Linear, Gradient Boosting)
- Classification (Logistic Regression, Neural Networks, Transformers)
- Time-series forecasting
- Bootstrap & risk analysis
- Computer Vision pipelines
- Model pipelines & hyperparameter tuning
- Feature importance & interpretability (SHAP)