A production-ready ML platform for the entire model lifecycle
Semantic-MLES is an end-to-end machine learning platform that enables teams to prototype ML pipelines locally and run the same pipeline on Kubernetes without code changes ("laptop-to-cloud parity"). It covers the full ML lifecycle: data ingestion, validation, feature engineering, training, evaluation, model serving, monitoring, and deployment.
| Feature | Description |
|---|---|
| End-to-End ML Pipeline | Data ingestion → preprocessing → training → evaluation → deployment in a single Sematic DAG |
| Laptop-to-Cloud Parity | Identical DAG definition and container images across dev and prod environments |
| Multi-Strategy Deployment | Blue-green, canary, and rolling Kubernetes deployments with automatic rollback |
| Model Monitoring | Real-time drift detection (KS-test, Chi-squared, PSI), performance tracking, and Prometheus metrics |
| Feature Store | Feast integration for both historical and online feature retrieval |
| Data Validation | Great Expectations and Pandera schema validation at every pipeline stage |
| Explainability & Fairness | SHAP-based feature attribution and Fairlearn fairness auditing |
| Experiment Tracking | MLflow and Weights & Biases (W&B) for parameter, metric, and artifact logging |
| Configurable Resources | Per-stage CPU/GPU/memory allocation via ResourceSpec |
The fastest way to see the platform in action is the standalone MLflow demo, which trains 10 RandomForest experiments and logs them to the MLflow tracking server.
python -m venv mles_env
source mles_env/bin/activate
pip install -r requirements.txt
seaborn,deepmerge, andtf-kerasare now included inrequirements.txt.
python simple_mlflow_example.pyExpected output:
Starting experiment: mles_demo - baseline_model
Completed baseline_model - R2: 0.9732, RMSE: 0.4176
...
Generated 10 runs in MLflow!
mlflow ui --port 5000Open http://localhost:5000 to see the mles_demo experiment with 10 completed runs. Each run contains logged parameters, metrics (R², RMSE, MSE), a trained model artifact, and a feature importance plot.
| Run name | Description |
|---|---|
| baseline_model | Default parameters |
| optimized_model | Tuned hyperparameters |
| deep_model | Deeper architecture |
| wide_model | Wider architecture |
| balanced_model | Balanced performance |
| fast_model | Optimised for inference speed |
| accurate_model | Maximum accuracy |
| efficient_model | Resource efficiency |
| robust_model | Improved outlier robustness |
| final_model | Production-ready best tradeoff |
After pip install -e ., the mles command is available with six subcommands.
mles train \
--data-source s3://my-bucket/data.csv \
--target-column price \
--model-type xgboost \
--cluster # omit to run locally
--tune-hyperparametersSupported model types: xgboost, random_forest, logistic_regression, neural_network.
mles predict \
--model-name xgboost_price \
--model-version latest \
--data-source s3://my-bucket/inference.csv \
--output-path s3://my-bucket/predictions.parquetmles serve \
--model-name xgboost_price \
--model-version 3 \
--host 0.0.0.0 \
--port 8000This starts a FastAPI server with /predict, /batch_predict, /health, and /metrics endpoints.
mles deploy \
--model-name xgboost_price \
--model-version 3 \
--environment prod \
--namespace mlesDeployment strategy is set in config/production.yaml (blue_green, canary, or rolling). Canary releases automatically shift traffic in steps (10% → 25% → 50% → 75% → 100%) and roll back if the error rate or P99 latency exceeds the configured thresholds.
mles monitor \
--model-name xgboost_price \
--model-version 3 \
--interval 300 # check every 5 minutesmles init --config-path ./configdocker-compose.yml starts all supporting services locally:
docker-compose up -d| Service | Port | Purpose |
|---|---|---|
| MLflow | 5000 | Experiment tracking and model registry |
| Sematic | 8001 | Pipeline DAG orchestration |
| MLES Model Server | 8000 | FastAPI model serving |
| Prometheus | 9090 | Metrics scraping |
| Grafana | 3000 | Metrics dashboards |
| AlertManager | 9093 | Alerting |
| Redis | 6379 | Feature store backend and caching |
| PostgreSQL | 5432 | Metadata storage |
| Kafka | 9092 | Streaming data ingestion |
| Zookeeper | 2181 | Kafka coordination |
├── simple_mlflow_example.py # Standalone MLflow demo
├── src/ # Core platform code
│ ├── cli.py # mles CLI entry point
│ ├── config.py # Configuration dataclasses
│ ├── models/ # BaseModel, ModelRegistry, XGBoostModel, etc.
│ ├── pipelines/ # TrainingPipeline, InferencePipeline (Sematic DAGs)
│ ├── preprocessing/ # DataLoader, DataValidator, FeatureEngineer
│ ├── monitoring/ # ModelMonitor, DriftDetector, MetricsCollector
│ └── deployment/ # DeploymentManager, ModelServer, InferenceService
├── production/ # Kubernetes manifests, Terraform, prod Docker config
│ ├── kubernetes/
│ │ ├── base/ # Base deployment and ConfigMap
│ │ ├── overlays/ # prod and staging Kustomize overlays
│ │ └── monitoring/ # Prometheus and Grafana configs
│ ├── docker/ # Dockerfile.model-server
│ └── scripts/ # deploy.sh, health-check.sh
├── development/ # Local dev configs and scripts
│ ├── config/ # local.yaml, development.yaml
│ ├── docker/ # docker-compose.dev.yml
│ └── scripts/ # setup_dev.sh, run_tests.sh
├── ci-cd/ # GitHub Actions workflows
├── config/ # Config templates (example_config.yaml)
├── samples/ # Example scripts and tutorials
├── scripts/ # Utility scripts
├── tests/ # Test suite
├── docs/ # Documentation
├── docker-compose.yml # Full local service stack
├── requirements.txt
└── setup.py
Copy env.example to .env and set the relevant variables:
cp env.example .envKey environment variables:
| Variable | Default | Description |
|---|---|---|
MLFLOW_TRACKING_URI |
http://localhost:5000 |
MLflow server URL |
S3_BUCKET |
mles-data |
Default data bucket |
KAFKA_BROKERS |
localhost:9092 |
Kafka broker addresses |
FEATURE_STORE_URI |
redis://localhost:6379 |
Feast online store backend |
PROMETHEUS_PUSHGATEWAY |
— | Prometheus Pushgateway URL |
CLUSTER_MODE |
false |
Set true to submit to Kubernetes |
For full configuration options see Configuration Guide.
Pipeline behaviour, deployment thresholds, and resource allocation are controlled via YAML:
# config/example_config.yaml
monitoring:
drift_threshold: 0.15
accuracy_threshold: 0.8
evaluation_interval_minutes: 15
deployment:
canary_percentage: 10
rollback_error_threshold: 0.05
rollback_latency_p99_ms: 1000
training_resources:
cpu: "4"
memory: "8Gi"
gpu: "1"The platform is built around four layers:
CLI (src/cli.py)
└── Pipelines (Sematic DAGs)
├── TrainingPipeline — 8 stages: load → validate → split → engineer → validate → train → evaluate → register
└── InferencePipeline — 6 stages: load model → load data → engineer → predict → monitor → save
└── Components
├── DataLoader — S3, local, Kafka, HTTP
├── DataValidator — Great Expectations, Pandera, built-in quality checks
├── FeatureEngineer — scaling, encoding, selection, Feast integration
├── ModelRegistry — XGBoost, RandomForest, LogisticRegression
├── ModelMonitor — Prometheus metrics, drift detection, SHAP, Fairlearn
└── DeploymentManager — Blue-green, canary, rolling via Kubernetes API
For more detail see Architecture Overview.
When running mles serve, the following endpoints are available:
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Model and server health check |
POST |
/predict |
Single prediction |
POST |
/batch_predict |
Batch predictions |
GET |
/metrics |
Latency and throughput metrics |
POST |
/reload |
Hot-reload model from new path |
Single prediction request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": {"feature_1": 1.5, "feature_2": "category_a"}}'See API Documentation for full schema.
source mles_env/bin/activate
python -m pytest tests/test_pipeline.py -vExpected result: 15 passed, 2 skipped.
The 2 skipped tests are for optional integrations that require specific environment setup:
| Skipped test | Reason | Fix |
|---|---|---|
TestBentoMLServing |
deepmerge missing from the active environment |
pip install deepmerge |
TestExplainabilityFairness::test_shap_explainability |
Keras 3 installed without tf-keras |
pip install tf-keras |
Both packages are listed in requirements.txt and will be present in a clean install.
| Issue | Solution |
|---|---|
ModuleNotFoundError: No module named 'src' |
Run commands from the project root or install with pip install -e . |
| MLflow UI not starting | Ensure the virtual environment is active: source mles_env/bin/activate |
| Missing dependencies | Run pip install -r requirements.txt |
| Port conflict on 5000 | Use mlflow ui --port 5001 |
| Kubernetes config not found | Ensure ~/.kube/config exists or run inside a cluster pod |
| W&B init failure | Set WANDB_MODE=disabled to skip W&B without affecting other tracking |
For more see Troubleshooting Guide.
- Architecture Overview
- Component Reference
- Data Flow
- Installation Guide
- Configuration Guide
- Deployment Guide
- Troubleshooting
- API Reference
- CLI Reference
- Model Reference
- Best Practices
- Use Cases
- Full Deployment Guide
MIT License — see the LICENSE file for details.