End-to-end ML system for NYC taxi fare prediction with multi-modal capabilities
A professional ML portfolio project featuring:
- 🎯 XGBoost fare prediction with hyperparameter tuning
- 🚀 Distributed training on Google Vertex AI
- 🤖 RAG-powered NYC attractions chatbot
- 🎙️ Multi-modal API (voice, text, chat)
- 📊 Comprehensive data analysis with Plotly
- Python 3.12+
- Google Cloud Platform account
- gcloud CLI installed and authenticated
# Clone repository
git clone https://github.com/misran3/nyc-scout.git
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install package
pip install -e .
# Copy environment template and edit with project details
cp .env.example .env# Authenticate with GCP
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
# Run setup script to enable required APIs and create buckets
python scripts/setup_gcp_infrastructure.py \
--project-id YOUR_PROJECT_ID \
--region us-east1# Local training (for testing)
python scripts/train.py \
--max_depth 6 \
--learning_rate 0.1 \
--subsample 1.0 \
--n_estimators 100
# Upload training package to GCS
python setup.py sdist
gsutil cp dist/*.tar.gz gs://YOUR_BUCKET/nyc-fare-predictor/dist/
# Launch Vertex AI hyperparameter tuning
gcloud ai hp-tuning-jobs create \
--region=us-east1 \
--display-name=nyc-fare-tuning-$(date +%m%d_%H%M) \
--config=config/vertex_ai_training.yaml# Deploy trained model to Vertex AI endpoint
python scripts/deploy_model.py \
--model-path gs://YOUR_BUCKET/path/to/model.bst \
--model-name nyc-fare-xgboost \
--endpoint-name nyc-fare-endpoint
# Update .env with endpoint ID
# VERTEX_AI_ENDPOINT_ID=<your-endpoint-id># Setup RAG corpus and upload knowledge base
python scripts/rag_pipeline.py --setup
# Test RAG system
python scripts/rag_pipeline.py --query "Tell me about museums in NYC"nyc-scout/
├── src/ # Core library code
│ ├── config.py # Configuration management
│ ├── features/ # Feature engineering
│ ├── data/ # Data loading
│ ├── gcp/ # GCP utilities
│ └── rag/ # RAG pipeline
├── scripts/ # Executable scripts
│ ├── train.py # Model training
│ ├── deploy_model.py # Model deployment
│ ├── rag_pipeline.py # RAG setup
│ └── setup_gcp_infrastructure.py
├── api/ # Flask API (to be completed)
├── notebooks/ # Data analysis (to be completed)
├── knowledge_base/ # NYC attractions (17 files)
├── config/ # Configuration files
├── infrastructure/ # Deployment scripts (to be completed)
└── data/ # Training data (Source: https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction)