This lab demonstrates a compact, reproducible workflow for tabular regression with scikit-learn using the California Housing dataset. It provides:
- Reusable pipelines with preprocessing and models
- Hyperparameter search via
GridSearchCV+KFold - Evaluation with RMSE, MAE, and R²
- A Jupyter notebook to explore experiments and figures
ml_python/lab2/
├─ src/
│ └─ housing.py # Pipelines, training, evaluation utilities
├─ notebooks/
│ ├─ experiment.ipynb # Main notebook to run experiments
│ └─ figures/ # Generated figures (kept in repo)
├─ requirements.txt # Python dependencies
├─ .gitignore # Ignore caches, data, models, etc.
└─ README.md # You are here
- Python 3.9+ (3.10/3.11 recommended)
- See pinned packages in
requirements.txt:- scikit-learn, numpy, pandas, matplotlib, jupyterlab
- dev tools: ruff, nbstripout
Optional but recommended:
- A virtual environment (e.g.,
venv,conda)
- Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Launch JupyterLab and open the notebook
jupyter lab
Open notebooks/experiment.ipynb and run the cells.
You can also run experiments from Python using the utilities in src/housing.py.
Available models (keys in MODELS):
simple_elastic: ElasticNet with standard scalingpoly_elastic: PolynomialFeatures (degree=2) + ElasticNetknn: KNeighborsRegressor with standard scaling
Each model includes a param_grid for GridSearchCV.
Example workflow:
from src.housing import set_seed, load_dataset, train, eval
# 1) Reproducibility
set_seed(1)
# 2) Load data
X_train, X_test, y_train, y_test = load_dataset(test_size=0.2, random_state=1)
# 3) Train with cross-validated grid search
search = train(
model="simple_elastic", # or "poly_elastic", "knn"
X_train=X_train,
y_train=y_train,
cv_splits=5,
random_state=1,
)
# 4) Evaluate on the test set
results = eval(search, X_test, y_test)
print("Best estimator:", results["estimator"]) # the fitted model inside the pipeline
print("RMSE:", results["rmse"]) # note: scikit-learn returns MSE by default; RMSE ~ sqrt(MSE)
print("MAE:", results["mae"])
print("R^2:", results["r2"]) Notes:
- The returned
searchis a fittedGridSearchCV. The best full pipeline is atsearch.best_estimator_. evalreturns predictions and metrics for convenience.
- Randomness: Use
set_seed(seed)and passrandom_statetoload_datasetandtrain. - Cross-validation: Configure with
cv_splitsintrain(default 5). - Hyperparameters: Adjust
param_gridper model insidesrc/housing.pyor pass a custom configuration by extending the code.
nbstripoutis included to help keep notebooks lightweight. To activate in this repo:
nbstripout --install
This installs a Git filter that strips large outputs on commit. You can reverse with nbstripout --uninstall.
- The dataset is fetched programmatically via
sklearn.datasets.fetch_california_housing, so no local data is required. .gitignoreexcludes typical data/model artifacts (data/,models/,*.pkl,*.csv, etc.) to keep the repo clean.
- If Jupyter cannot find the kernel, ensure your virtual environment is active when installing and launching JupyterLab.
- Version conflicts: Reinstall with a fresh environment and
pip install -r requirements.txt. - Long grid searches: Reduce
cv_splits, narrowparam_grid, or try a smaller model first.
No license specified. If you plan to share or publish, consider adding a LICENSE file.