Welcome to the Llama Stack Demo for Red Hat OpenShift AI (RHOAI). This comprehensive demo deploys everything you need to test AI agents with MCP tools on RHOAI using Llama Stack:
- Models: Llama 3.1 8B (quantized w4a16) deployed using vLLM Serving Runtime and automatically registered in Llama Stack
- MCP Servers: Rust-based MCP servers deployed as standard deployments and automatically configured in Llama Stack
- Llama Stack Server: Deployed using the Llama Stack Operator (included in RHOAI) via the
LlamaStackDistributionCustom Resource - Vector Databases: Support for Milvus (inline and remote) and PostgreSQL with pgvector
- RAG Pipelines: Automated document ingestion using Kubeflow Pipelines (DSPA)
- Overview
- Architecture
- Components
- Requirements
- Installation
- Configuration
- Usage
- Business Rules
- Uninstall
- Monitoring
The Eligibility Assessment System is powered by Llama Stack and Model Context Protocol (MCP). It helps assess eligibility for Family Care Unpaid Leave Support based on the Republic of Lysmark's Act No. 2025/47-SA.
The system combines Llama Stack with MCP servers and Retrieval Augmented Generation (RAG) to provide accurate, context-aware assessments for:
- Care for sick/injured family members
- Childcare for multiple children
- Adoption cases
- Single-parent family scenarios
The Helm chart deploys:
- LlamaStackDistribution CR: Processed by the Llama Stack Operator to create a fully configured Llama Stack server with all providers
- Streamlit Application: Interactive UI for eligibility consultations (
{app}-app) - FastAPI Server: REST API for programmatic access (
{app}-api) - MCP Servers: Eligibility Engine, Compatibility Engine, Cluster Insights, Finance Engine
- Vector Databases: Milvus (with Attu UI) and/or PostgreSQL (with CloudBeaver UI)
- Models: Llama 3.1 8B via KServe InferenceService (using vLLM runtime)
- Kubeflow Pipelines: DSPA for automated document ingestion (optional)
- Llama Stack Documentation
- Model Context Protocol (MCP)
- Eligibility Engine MCP Server
- Compatibility Engine MCP Server
- Cluster Insights MCP Server
- Finance Engine MCP Server
The system supports multiple vector database configurations for RAG capabilities:
Llama Stack's built-in Milvus provider stores vectors locally using a file-based database (/tmp/milvus.db). This is always available as the milvus provider.
- Use case: Development, testing, single-instance deployments
- Provider ID:
milvus
When milvus.deploy: true, the chart deploys a standalone Milvus instance:
| Component | Description |
|---|---|
| etcd | Distributed key-value store for Milvus metadata |
| milvus-standalone | Milvus vector database server |
| Attu | Web-based Milvus management UI |
When milvus.enableRemote: true, the remote::milvus provider (milvus-remote) is configured in Llama Stack.
- Use case: Production deployments, persistent vector storage
- Provider ID:
milvus-remote
When postgres.deploy: true, the chart deploys PostgreSQL with pgvector:
| Component | Description |
|---|---|
| PostgreSQL | Database with pgvector extension (image: pgvector/pgvector:pg16) |
| CloudBeaver | Web-based database management UI |
When postgres.enablePgVector: true, the remote::pgvector provider is configured in Llama Stack.
- Use case: Teams using PostgreSQL, simpler operational model
- Provider ID:
pgvector
The default configuration includes four MCP servers:
| Server | Description | Repository |
|---|---|---|
| eligibility-engine | Evaluates unpaid leave eligibility | eligibility-engine-mcp-rs |
| compatibility-engine | Checks compatibility requirements | compatibility-engine-mcp-rs |
| cluster-insights | Provides OpenShift cluster information | cluster-insights-mcp-rs |
| finance-engine | Financial calculations and data | finance-engine-mcp-rs |
All MCP servers use the streamable-http transport protocol.
The system uses Sentence Transformers with the nomic-ai/nomic-embed-text-v1.5 model:
- Vector Dimension: 768
- Chunk Size: 800 tokens
- Chunk Overlap: 400 tokens
When pipelines.enabled: true, the chart deploys:
- DataSciencePipelinesApplication (DSPA): Pipeline orchestration platform
- Pipeline Upserter Hook: Imports RAG pipeline definitions from Git
- Pipeline Runner Hook: Creates pipeline runs for each vector store provider
Recommended:
- GPU: 1x NVIDIA A10G or equivalent
- CPU: 8 cores
- Memory: 24 Gi
- Storage: 20 Gi
Minimum:
- GPU: 1x NVIDIA GPU (shared allocation supported)
- CPU: 6 cores
- Memory: 16 Gi
- Storage: 10 Gi
The default configuration uses the quantized Llama 3.1 8B model (w4a16) for efficient GPU utilization.
- Red Hat OpenShift: 4.20+ (tested on 4.20)
- Red Hat OpenShift AI: 3.2+ (tested on 3.2)
- Includes the Llama Stack Operator which manages
LlamaStackDistributionCustom Resources
- Includes the Llama Stack Operator which manages
- OpenShift Service Mesh: Required for KServe
- OpenShift Serverless: Required for KServe
The Llama Stack Operator is a component of Red Hat OpenShift AI that simplifies the deployment and management of Llama Stack distributions. It introduces the LlamaStackDistribution Custom Resource (CR) which allows you to declaratively configure:
- Llama Stack server configuration
- Model providers and endpoints
- Vector database providers (Milvus, pgvector)
- MCP server integrations
- Telemetry and monitoring settings
The Helm chart creates a LlamaStackDistribution CR that the operator reconciles into a fully configured Llama Stack deployment.
If using Kubeflow Pipelines (pipelines.enabled: true), Minio must be installed in the minio namespace.
Verify Minio is available:
oc get svc minio -n minioExpected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
minio ClusterIP 172.30.x.x <none> 9000/TCP 1d
Default Minio Configuration:
- Service:
minio.minio.svc - Port:
9000 - Access Key:
minio - Secret Key:
minio123 - Bucket:
pipelines
At least one GPU-enabled worker node must have this label:
group: llama-stack-demogit clone https://github.com/alpha-hack-program/llama-stack-demo.git
cd llama-stack-demooc login --server=<your-cluster-api> --token=<your-token>For simplicity we're assiging nodes based on the user running these instructions. You would usually extract the user using oc whoami.
LAB_USER=<YOUR_USER>You can use this script to assign a user to a node.
Optional: set the type of instance to assign with export INSTANCE_TYPE="g5.2xlarge"
./scripts/assign-node-to-user.sh ${LAB_USER}Create the new project and label it for OpenShift AI:
PROJECT="llama-stack-demo-${LAB_USER}"
oc new-project ${PROJECT}
oc label namespace ${PROJECT} modelmesh-enabled=false opendatahub.io/dashboard=trueDefault deployment (NVIDIA GPU with Llama 3.1 8B):
helm install llama-stack-demo helm/ --set assigned="${PROJECT}" --namespace ${PROJECT} --timeout 20mWith secrets file (for remote models with API keys):
helm install llama-stack-demo helm/ -f helm/values-secrets.yaml --set assigned="${PROJECT}" --namespace ${PROJECT} --timeout 20moc -n ${PROJECT} get pods -wExpected pods (5-10 minutes to start):
NAME READY STATUS RESTARTS AGE
llama-stack-demo-0 1/1 Running 0 8m
llama-stack-demo-app-xxxxx 1/1 Running 0 8m
llama-stack-demo-api-xxxxx 1/1 Running 0 8m
eligibility-engine-xxxxx 1/1 Running 0 7m
compatibility-engine-xxxxx 1/1 Running 0 7m
cluster-insights-xxxxx 1/1 Running 0 7m
finance-engine-xxxxx 1/1 Running 0 7m
milvus-standalone-xxxxx 1/1 Running 0 6m
etcd-deployment-xxxxx 1/1 Running 0 6m
attu-xxxxx 1/1 Running 0 6m
pg-lsd-xxxxx 1/1 Running 0 6m
cloudbeaver-xxxxx 1/1 Running 0 6m
llama-3-1-8b-w4a16-predictor-xxxxx 2/2 Running 0 10m
| Parameter | Description | Default |
|---|---|---|
milvus.deploy |
Deploy Milvus infrastructure | true |
milvus.enableRemote |
Enable milvus-remote provider |
true |
milvus.host |
Milvus service hostname | milvus-service |
milvus.port |
Milvus gRPC port | 19530 |
milvus.storage |
PVC storage size | 20Gi |
milvus.image |
Milvus image | milvusdb/milvus:v2.6.0 |
Examples:
# Deploy Milvus with remote provider
helm install llama-stack-demo helm/ \
--set milvus.deploy=true \
--set milvus.enableRemote=true \
--namespace ${PROJECT}
# Use only inline Milvus (no deployment)
helm install llama-stack-demo helm/ \
--set milvus.deploy=false \
--set milvus.enableRemote=false \
--namespace ${PROJECT}
# Connect to external Milvus
helm install llama-stack-demo helm/ \
--set milvus.deploy=false \
--set milvus.enableRemote=true \
--set milvus.host=external-milvus.example.com \
--set milvus.token=your-token \
--namespace ${PROJECT}| Parameter | Description | Default |
|---|---|---|
postgres.deploy |
Deploy PostgreSQL with pgvector | true |
postgres.enablePgVector |
Enable pgvector provider |
true |
postgres.host |
PostgreSQL hostname | pg-lsd-service |
postgres.port |
PostgreSQL port | 5432 |
postgres.user |
Database username | llamastack |
postgres.password |
Database password | llamastack |
postgres.db |
Database name | llamastack |
Examples:
# Deploy PostgreSQL with pgvector
helm install llama-stack-demo helm/ \
--set postgres.deploy=true \
--set postgres.enablePgVector=true \
--namespace ${PROJECT}
# Disable PostgreSQL entirely
helm install llama-stack-demo helm/ \
--set postgres.deploy=false \
--set postgres.enablePgVector=false \
--namespace ${PROJECT}
# Connect to external PostgreSQL
helm install llama-stack-demo helm/ \
--set postgres.deploy=false \
--set postgres.enablePgVector=true \
--set postgres.host=external-postgres.example.com \
--set postgres.user=myuser \
--set postgres.password=mypassword \
--namespace ${PROJECT}| Parameter | Description | Default |
|---|---|---|
pipelines.enabled |
Enable DSPA deployment | true |
pipelines.connection.host |
Minio host | minio.minio.svc |
pipelines.connection.port |
Minio port | 9000 |
pipelines.connection.awsS3Bucket |
S3 bucket | pipelines |
pipelines.runner.vectorStoreProviderIds |
Vector stores to populate | milvus, milvus-remote, pgvector |
pipelines.healthCheck.maxRetries |
Max health check attempts | 30 |
pipelines.healthCheck.delay |
Delay between retries (seconds) | 10 |
Examples:
# Enable pipelines for all vector stores
helm install llama-stack-demo helm/ \
--set pipelines.enabled=true \
--set pipelines.runner.vectorStoreProviderIds="milvus,milvus-remote,pgvector" \
--namespace ${PROJECT}
# Disable pipelines
helm install llama-stack-demo helm/ \
--set pipelines.enabled=false \
--namespace ${PROJECT}
# Run pipelines only for pgvector
helm install llama-stack-demo helm/ \
--set pipelines.enabled=true \
--set pipelines.runner.vectorStoreProviderIds="pgvector" \
--namespace ${PROJECT}When pipelines.enabled: true, these Helm hooks execute during post-install and post-upgrade:
-
Pipeline Upserter Hook (weight: 4)
- Clones pipeline definitions from Git repository
- Compiles and uploads pipelines to DSPA
- Uses
uvpackage manager for dependencies
-
Pipeline Runner Hook (weight: 5)
- Creates a pipeline run for each vector store provider in
vectorStoreProviderIds - Ingests documents from the configured Git repository
- Configurable parameters:
gitRepo,gitContext,filenames,vectorStoreName,embeddingModel,chunkSizeInTokens,chunkOverlapInTokens
- Creates a pipeline run for each vector store provider in
oc get route ${PROJECT}-app -n ${PROJECT} -o jsonpath='{.spec.host}'Open the URL in your browser for the interactive eligibility assessment interface.
oc get route ${PROJECT}-api -n ${PROJECT} -o jsonpath='{.spec.host}'Use this endpoint for programmatic access.
oc get route ${PROJECT}-route -n ${PROJECT} -o jsonpath='{.spec.host}'Direct access to the Llama Stack API.
oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}'Navigate to Data Science Projects → llama-stack-demo to see deployed models and workbenches.
- Attu (Milvus):
oc get route attu -n ${PROJECT} -o jsonpath='{.spec.host}' - CloudBeaver (PostgreSQL):
oc get route cloudbeaver -n ${PROJECT} -o jsonpath='{.spec.host}'
- "My mother had an accident and she's at the hospital. I have to take care of her, can I get access to the unpaid leave aid?"
- "I have just adopted two children, at the same time, aged 3 and 5, am I eligible for the unpaid leave aid? How much?"
- "I'm a single mom and I just had a baby, may I get access to the unpaid leave aid?"
- "Enumerate the legal requirements to get the aid for unpaid leave."
You are a helpful AI assistant that uses tools to help citizens of the Republic of Lysmark. Answers should be concise and human readable. AVOID references to tools or function calling nor show any JSON. Infer parameters for function calls or instead use default values or request the needed information from the user. Call the RAG tool first if unsure. Parameter single_parent_family only is necessary if birth/adoption/foster_care otherwise use false.
| Case | Situation | Monthly Benefit | Description |
|---|---|---|---|
| A | Illness/accident | 725€ | First-degree family care for sick or accident victim |
| B | Third child or more | 500€ | Birth with 3+ children (at least 2 under 6 years) |
| C | Adoption/foster care | 500€ | Adoption or foster care (>1 year duration) |
| D | Multiple birth/adoption | 500€ | Multiple delivery, adoption, or foster care |
| E | Single-parent family | 500€ | Single-parent family with newborn |
| NONE | Not eligible | 0€ | Requirements not met |
- Must have first-degree family relationship (father, mother, son, daughter, spouse, partner)
- Situation must match one of the covered cases
- Documentation requirements vary by case
helm uninstall llama-stack-demo --namespace ${PROJECT}oc delete jobs -l "app.kubernetes.io/part-of=llama-stack-demo" -n ${PROJECT}oc delete project ${PROJECT}The system supports OpenTelemetry tracing when otelCollector.enabled: true. Configure your DSCInitialization:
apiVersion: dscinitialization.opendatahub.io/v2
kind: DSCInitialization
metadata:
name: default-dsci
spec:
monitoring:
managementState: Managed
metrics:
replicas: 1
storage:
retention: 90d
size: 50Gi
namespace: redhat-ods-monitoring
traces:
sampleRatio: '1.0'
storage:
backend: pv
retention: 2160h0m0s
size: 100GiAfter helm upgrade and before relying on Grafana/traces, run once:
./scripts/setup-monitoring.shThis applies general resources (namespace, Tempo, OpenTelemetry collector) and patches DSCInitialization so the OpenShift AI operator deploys Prometheus (data-science-monitoringstack-prometheus) and data-science-instrumentation. Without the DSCI patch, the Grafana Prometheus datasource and instrumentation.opentelemetry.io/inject-python (e.g. in the playground) have no backing services. Manifests: scripts/resources/. Use --dry-run to preview.
To verify that monitoring and telemetry are ready, run:
./scripts/check-monitoring-telemetry.shUse --lenient to only verify CRs and services (skip pod-running and Instrumentation; useful right after setup). Use --skip-prometheus or --skip-instrumentation if those components are not required.
If the collector service is missing, create it:
apiVersion: v1
kind: Service
metadata:
name: data-science-collector
namespace: redhat-ods-monitoring
labels:
app.kubernetes.io/component: opentelemetry-collector
app.kubernetes.io/instance: redhat-ods-monitoring.data-science-collector
app.kubernetes.io/part-of: opentelemetry
spec:
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
appProtocol: grpc
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
appProtocol: http
- name: prometheus
port: 8889
targetPort: 8889
protocol: TCP
selector:
app.kubernetes.io/component: opentelemetry-collector
app.kubernetes.io/instance: redhat-ods-monitoring.data-science-collector
app.kubernetes.io/managed-by: opentelemetry-operator
app.kubernetes.io/part-of: opentelemetry
type: ClusterIPThe inner to outer development loop:
- Decision table design
- Code implementation
- Unit testing
- MCP Inspector validation
- Claude Desktop testing
- Llama Stack Local with RHOAI models
- Llama Stack on RHOAI
Key learnings:
- Started with TypeScript MCP server, switched to Rust for better performance
- Simplified logic from full legal document coverage to decision table approach
- Variable naming (e.g.,
is_single_parent_family,number_of_children_after) significantly impacts SLM performance
