This repository contains the Kubernetes custom controller for managing long-running inferences on KServe. It decouples request handling from execution by leveraging job-based orchestration.
The KServe Inference Controller is designed to handle complex, long-running inference tasks that would typically time out or block a standard controller loop. Instead of making direct HTTP calls to KServe, the operator orchestrates Kubernetes Jobs to act as "Runners." These runners handle the entire lifecycle of an inference: data retrieval from a source, the inference call itself, and the persistence of results to a storage backend.
This approach ensures:
- Resilience: If a node fails, the Job is automatically rescheduled.
- Scalability: Each inference run has its own dedicated resources.
- Extensibility: Support for new data sources or storage backends only requires updating the runner image, not the core controller.
The system revolves around two primary Custom Resources (CRs) and a specialized Runner image:
- InferenceConfig: A static definition of a model's deployment. It specifies the KServe endpoint, the runner image to use, and the storage backend configuration (Input/Output).
- InferenceRun: A single execution trigger. It references a config and provides specific execution parameters (e.g., table names, query filters).
- Runner Job: A containerized process spawned by the controller that executes the business logic defined in the "Contract."
- User creates an InferenceConfig for a specific model (e.g.,
sklearn-iris). - User triggers an InferenceRun referencing that config.
- The Controller generates a Contract (JSON) and spawns a Kubernetes Job.
- The Runner reads the contract, fetches data, calls KServe, and saves the output.
- The Controller monitors the Job and updates the
InferenceRunstatus.
The communication between the controller and the execution Job is governed by a Contract. This contract is passed to the runner as a JSON file, allowing for a standardized way to handle diverse storage backends. The runner is meant to be as lightweight as possible and only get/store data. No computation should be done by the runner. Data transformations should be handled by a notebook in the finops-database-handler.
Example contract passed by the controller to the Runner, through a ConfigMap:
{
"jobId":"5be07ada-5fe0-4c9e-a8bc-aaae67d8d344",
"jobName":"inf-example-inference-run-triton-5be07ada",
"kserve":{
"modelName":"sklearn-iris",
"modelUrl":"kserve-krateo-ttm-predictor.kserve-test.svc.cluster.local/v2/models/granite-timeseries-ttm-r2/infer",
"modelVersion":"v2",
"modelInputName":"past_values"
},
"input":{
"krateo":{
"api":{
"endpointRef":{
"name":"finops-database-handler-endpoint",
"namespace":"kserve-controller-system"
},
"headers":[
"Accept: application/json",
"Content-Type: application/json"
],
"path":"/compute/tritoninput",
"verb":"POST"
}
}
},
"output":{
"krateo":{
"api":{
"endpointRef":{
"name":"finops-database-handler-endpoint",
"namespace":"kserve-controller-system"
},
"headers":[
"Accept: application/json",
"Content-Type: application/json"
],
"path":"/compute/tritonoutput",
"verb":"POST"
}
}
},
"parameters":{
"input_data_length":"512",
"input_table_column_name":"average",
"input_table_name":"azuretoolkit",
"key_name":"resourceid",
"key_value":"sample_vm",
"output_table_name":"kserve_controller_output_triton"
}
}To see how this specific contract is used, check runners/krateo/main.go and the counter part notebooks in charts/chart/templates/notebook-triton.yaml. Note that the runner is inject with the environment variable pod_uid, which might be useful to store data for scheduled inference runs.
The storage.input and storage.output keys in the CRD have no schema. This allows the InferenceConfig to support any storage provider (e.g., Krateo FinOps, S3, GCS, etc.) without changing the controller. The runner receives the contract with the data unmodified. Therefore, by providing a specialized runner image, you can implement custom logic to parse these raw configurations and interact with any proprietary or cloud-native data store.
Defines the "How" and "Where" of the model and storage.
apiVersion: ai.krateo.io/v1
kind: InferenceConfig
metadata:
name: iris-inference-config
namespace: kserve-controller-system
spec:
autoDeletePolicy: None # None | DeleteOnCompletion | DeleteOnSuccess
image: ghcr.io/krateoplatformops/kserve-krateo-runner:0.1.0
modelName: sklearn-iris
modelUrl: sklearn-iris-predictor.kserve-test.svc.cluster.local/v2/models/sklearn-iris/infer
modelVersion: v2
storage:
input:
krateo:
api:
endpointRef:
name: finops-database-handler-endpoint
namespace: kserve-controller-system
path: /compute/kserveinput
verb: POST
headers:
- 'Accept: application/json'
- 'Content-Type: application/json'
output:
krateo:
api:
endpointRef:
name: finops-database-handler-endpoint
namespace: kserve-controller-system
path: /compute/kserveoutput
verb: POST
headers:
- 'Accept: application/json'
- 'Content-Type: application/json'Defines the "When" and "What" of a specific execution.
apiVersion: ai.krateo.io/v1
kind: InferenceRun
metadata:
name: iris-run-january
namespace: kserve-controller-system
spec:
configRef:
name: example-inference-config
namespace: kserve-controller-system
timeoutSeconds: 3600
schedule: * * * * *
parameters:
input_table_name: kserve_controller_input
output_table_name: kserve_controller_output
If the schedule field is populated, the controller creates a CronJob instead of a Job. The schedule is passed as is to the CronJob. The AutoDeletePolicy will not apply to the CronJob. parameters are passed as is in the JSON contract.
The controller can be configured via environment variables to tune its reconciliation behavior:
POLLING_INTERVAL: Duration (e.g.,1m) between status checks.MAX_RECONCILE_RATE: Number of concurrent workers.
The operator can be installed through its Helm chart:
helm repo add krateo https://charts.krateo.io
helm repo update krateo
helm template kserve-controller krateo/kserve-controller | kubectl apply -f -The repository contains several objects needed by the kserve-controller:
- the controller code in
internalandapi - the helm chart for the controller with crds in
/chart - the model for TTM adapted for the Triton KServe engine in
models - the Krateo runners for sklearn-iris and triton-ttm for the storage finops-database-handler in
runners/krateo-irisandrunners/krateo-ttm - example CRs in
testdata - e2e test code in testing
test