Inference Model

How Runtime provides the execution layer for AI inference workloads.

Agentic Friendly

Runtime includes an inference layer so higher platform services can run model workloads in production without rebuilding the serving substrate themselves.

Main Inference Components

ComponentMain role
KubeAIorchestrates and manages inference-serving workloads on Kubernetes
vLLMserves large language model workloads efficiently
FasterWhispersupports speech-to-text inference workloads

Execution Pattern

The Runtime inference model separates operational concerns from product concerns:

  • Runtime provides the serving and orchestration substrate
  • higher layers decide which models, services, or user experiences to expose

This means a CoreAI capability may be the user-facing service, while Runtime is still the layer that performs the actual execution work beneath it.

Why This Matters

By keeping inference in Runtime:

  • model execution becomes part of the platform foundation
  • scaling and operational controls can be handled consistently
  • higher layers can focus on product logic rather than rebuilding serving infrastructure

This is one of the clearest examples of Runtime acting as the execution base for the rest of the platform.

On this page