Inference Model
How Runtime provides the execution layer for AI inference workloads.
Runtime includes an inference layer so higher platform services can run model workloads in production without rebuilding the serving substrate themselves.
Main Inference Components
| Component | Main role |
|---|---|
| KubeAI | orchestrates and manages inference-serving workloads on Kubernetes |
| vLLM | serves large language model workloads efficiently |
| FasterWhisper | supports speech-to-text inference workloads |
Execution Pattern
The Runtime inference model separates operational concerns from product concerns:
- Runtime provides the serving and orchestration substrate
- higher layers decide which models, services, or user experiences to expose
This means a CoreAI capability may be the user-facing service, while Runtime is still the layer that performs the actual execution work beneath it.
Why This Matters
By keeping inference in Runtime:
- model execution becomes part of the platform foundation
- scaling and operational controls can be handled consistently
- higher layers can focus on product logic rather than rebuilding serving infrastructure
This is one of the clearest examples of Runtime acting as the execution base for the rest of the platform.