Inference Model

Runtime includes an inference layer so higher platform services can run model workloads in production without rebuilding the serving substrate themselves.

Main Inference Components

Component	Main role
KubeAI	orchestrates and manages inference-serving workloads on Kubernetes
vLLM	serves large language model workloads efficiently
FasterWhisper	supports speech-to-text inference workloads

Execution Pattern

The Runtime inference model separates operational concerns from product concerns:

Runtime provides the serving and orchestration substrate
higher layers decide which models, services, or user experiences to expose

This means a CoreAI capability may be the user-facing service, while Runtime is still the layer that performs the actual execution work beneath it.

Why This Matters

By keeping inference in Runtime:

model execution becomes part of the platform foundation
scaling and operational controls can be handled consistently
higher layers can focus on product logic rather than rebuilding serving infrastructure

This is one of the clearest examples of Runtime acting as the execution base for the rest of the platform.

Inference Model

Main Inference Components

Execution Pattern

Why This Matters

On this page