immo-eliza-ml

A streamlined machine-learning workflow for predicting Belgian real estate prices. It includes data cleaning, feature engineering, linear baselines first (Ridge, Lasso, ElasticNet), additional non-linear models for comparison, and a final tuned log-target XGBoost model for the best stability and accuracy.

🔎 Project Overview

This project predicts real-estate prices in Belgium using structured data.
The focus is on:

Clean, leak-free preprocessing
Reliable location features using an official postal-code reference (postal_code → province)
Evaluation of linear and o models
A final robust log-target XGBoost model achieving the best stability and accuracy.

🧹 Data Cleaning Pipeline

The unified cleaning step (enhanced_clean) handles:

Boolean normalization
Numeric parsing
Sanity checks on build year, living area, rooms
Removal of invalid price entries
Dropping noisy fields
Removal of locality_name to reduce cardinality and simplify modeling

Output: cleaned_v2.csv

🧬 Feature Engineering

Located in src/feature_engineering.py.

Adds:

region (Flanders/Wallonia/Brussels) mapped from province
Build-year features (house_age, build_decade, age flags)
Boolean flags (garden_flag, terrace_flag, swimming_pool_flag)

🔧 Preprocessing Pipeline

The preprocessing pipeline prepares the dataset for all models and ensures consistent, leak-free transformations.

Main steps:

Train/Test Split (80/20):
The dataset is split once to keep evaluation consistent.
Outlier Removal (training only):
IQR filtering is applied only to the training set for price, living_area, and number_rooms to avoid leaking information into the test set.
Column Detection:
Numeric and categorical columns are automatically identified.
postal_code is always treated as categorical.
Numeric Pipeline:
- Median imputation
- Standard scaling
Categorical Pipeline:
- Most frequent imputation
- One-Hot Encoding with unknown-category handling
ColumnTransformer:
The numeric and categorical pipelines are combined into a unified preprocessing block used by all models.

This ensures that every model receives clean, encoded, and scaled input data with no leakage between training and testing.

🤖 Model Development

Models evaluated:

Ridge / Lasso / ElasticNet — weak generalization
Random Forest — moderate but unstable
XGBoost (raw target) — improved but high variance
XGBoost (log-target) — best model, stable and consistent (test R² ~0.65)

📊 Model Performance Comparison Tuned Linear regression VS Tuned XGBoost

Model	MAE (Train)	RMSE (Train)	R² (Train)	MAE (Test)	RMSE (Test)	R² (Test)
Ridge	17,134.86	23,333.16	0.9653	86,478.20	168,409.36	0.5548
Lasso	45,320.05	56,236.46	0.7987	86,777.13	169,305.18	0.5500
ElasticNet	31,640.48	42,982.00	0.8824	86,785.47	168,936.84	0.5520
XGBoost (Tuned, No Val)	71,626.79	124,144.21	0.7837	81,431.27	157,652.82	0.6100

🗂️ Project Structure

immo-eliza-ml/
│
├── src/
│   ├── __init__.py
│   ├── main.py
│   ├── cleaning.py
│   ├── preprocessing.py
│   ├── train_xgboost_log.py
│   ├── Lin_reg.py
├── data/
│   ├── raw/
│   ├── processed/
│   ├── zipcodes_num_nl_2025.csv
└── models/

📦 Requirements

pandas
numpy
scikit-learn
xgboost
matplotlib
seaborn

How to run it

Install dependencies

pip install -r requirements.txt

Future Improvements

Hyperparameter tuning for the log-XGBoost model
Additional feature engineering
Cross-validation for more robust evaluation
Testing alternative models (e.g., LightGBM, CatBoost)

⚠️ Limitations

The model relies heavily on the quality and comprehensiveness of the input data. It does not account for market trends or economic conditions. The model's predictions are specific to Belgium and may not generalize well to other regions.

👥 Contributors

This project is part of AI & Data Science Bootcamp training at </becode and it was done by:

Welederufeal Tadege LinkedIn | Github under the supervision of AI & data science coach Vanessa Rivera Quinones

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

immo-eliza-ml

📑 Table of Contents

🔎 Project Overview

🧹 Data Cleaning Pipeline

🧬 Feature Engineering

🧬 Feature Engineering

🔧 Preprocessing Pipeline

🤖 Model Development

📊 Model Performance Comparison Tuned Linear regression VS Tuned XGBoost

🗂️ Project Structure

📦 Requirements

How to run it

Install dependencies

Future Improvements

⚠️ Limitations

👥 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

welde-data/immo-eliza-ml

Folders and files

Latest commit

History

Repository files navigation

immo-eliza-ml

📑 Table of Contents

🔎 Project Overview

🧹 Data Cleaning Pipeline

🧬 Feature Engineering

🧬 Feature Engineering

🔧 Preprocessing Pipeline

🤖 Model Development

📊 Model Performance Comparison Tuned Linear regression VS Tuned XGBoost

🗂️ Project Structure

📦 Requirements

How to run it

Install dependencies

Future Improvements

⚠️ Limitations

👥 Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages