Model Installer API
How CoreAI exposes the Model Installer API.
The Model Installer API is the CoreAI service used to turn model locations and model artifacts into actual serving deployments.
It sits between model sources such as Hugging Face, S3-compatible storage, and MLflow on one side, and the Runtime serving path on the other side.
In practice, this is the service that lets teams do two kinds of work:
- import models into the local
MLflowrepository - register a model for inference so it becomes available through the platform serving layer
What The Service Actually Does
When a model is registered for inference, the service does more than just store metadata.
It:
- creates a
kubeai.org/v1Modelresource in Kubernetes - registers the same model in the internal proxy layer used for OpenAI-compatible access
- keeps long-running import flows asynchronous when artifact download or upload takes time
- protects the API with
Keycloaktoken validation
For direct inference registration, the request goes through POST /register_model.
For artifact import and repository management, the service also exposes endpoints for:
POST /download_hf_modelPOST /download_s3_modelPOST /register_s3_modelDELETE /delete_modelDELETE /unregister_model/{name}/{namespace}
Choose The Right Path
Use these rules of thumb:
- If you already have a model URL that the serving layer can use, register it directly with
register_model. - If you want the model stored in the platform's local repository first, import it into
MLflow, then deploy it from the repository. - If weights already exist in
S3orMinIOand you do not want to copy them again, useregister_s3_model.
Direct Deployment For Inference
The main inference deployment endpoint is:
POST /register_modelThe backend schema requires these core fields:
nameenginefeaturesurl
Common deployment fields include:
namespace, defaulting tokubeairesourceProfile, in the format<profile>:<count>such asnvidia-gpu-l4:1replicas,minReplicas, andmaxReplicastimeout,stream_timeout, andmax_retriesmodel_modeand model metadata such as token limits or embedding dimensions
The service supports these serving engines at API level:
VLLMOLlamaFasterWhisperInfinity
The supported URL patterns depend on the engine and storage path:
hf://<org>/<model>pvc://<pvcName>orpvc://<pvcName>/<subpath>s3://<bucket>/<path>gs://<bucket>/<path>oss://<bucket>/<path>ollama://<model>forOLlama
For object storage URLs such as s3://, gs://, and oss://, the API supports cache-based serving with cacheProfile.
Example: Deploy A Model Directly
{
"name": "mistral-small",
"namespace": "kubeai",
"engine": "VLLM",
"features": ["TextGeneration"],
"url": "hf://mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"resourceProfile": "nvidia-gpu-l4:1",
"replicas": 1,
"minReplicas": 0,
"maxReplicas": 2,
"timeout": 30,
"stream_timeout": 1,
"max_retries": 5,
"model_mode": "chat",
"max_input_tokens": 8192,
"max_output_tokens": 2048
}After a successful registration, the service creates the Kubernetes model resource and also registers the model in the proxy layer so it can be surfaced through the platform's OpenAI-compatible path.
Importing Models Into The Local Repository
The local repository path is built around MLflow.
This is useful when you want a platform-managed artifact location and a reusable repository entry before deployment.
Hugging Face Import
POST /download_hf_modelThis flow:
- downloads the model from the
Hugging Face Hub - uploads the artifacts to
MLflow - registers a model version in
MLflow - optionally registers the imported model for inference if
register_inferenceis set totrue
These jobs run in the background.
S3 Import With Copy
POST /download_s3_modelThis flow downloads model artifacts from S3-compatible storage and then uploads them into MLflow.
The API accepts direct credentials or an OpenBao secret reference for S3 access.
S3 Registration Without Copy
POST /register_s3_modelThis is the lighter-weight path when the model is already present in object storage.
Instead of downloading and re-uploading the weights, the service registers the S3 URI directly as the model source in MLflow.
That makes it the better choice when:
- artifacts are already in a stable bucket
- duplicate storage is not wanted
- you still want an
MLflowmodel entry and version history
How The Portal Uses It
The CoreAI Portal uses the service in two main ways: model deployment and model repository import.
Portal Deployment Flow
For deployment, the portal ultimately sends a POST /register_model request.
The user-facing flow looks like this:
- Open the model deployment flow from the
Modelsarea. - Choose either an easy preset-driven path or the advanced deployment form.
- Fill in the model source, deployment name, engine, resource profile, scaling, and optional advanced settings.
- Submit the form so the portal sends the deployment payload to the Model Installer service.
Easy Setup In The Portal
The easy setup path is preset-based.
It:
- lets users browse curated model presets by category such as
llm,embedding,audio, andmultimodal - pre-fills deployment defaults such as engine, resource profile, token limits, tags, and features
- keeps most preset fields read-only in simple mode
- allows switching to
Customize Deploymentwhen teams need to tune the deployment
In easy mode, the portal also pre-fills:
namespaceaskubeaiownerfrom the signed-in user email
Advanced Setup In The Portal
The advanced form is closer to the raw API.
It has separate steps for:
- model information
- deployment configuration
- advanced configuration
The portal exposes a Model URL selector with two sources:
Manual URLFrom Repository
When From Repository is used, the portal:
- fetches registered
MLflowmodels - allows selection of only
READYversions - resolves the artifact URI for the latest usable version
- pre-fills the deployment form with that artifact-backed URL
The portal also auto-generates the deployment name from the chosen URL when possible.
Portal Field Mapping
The portal does a small amount of transformation before calling the backend.
- It combines
resourceProfileandinstancesinto the API'sresourceProfileformat, such asnvidia-gpu-l4:2. - It maps model mode choices into serving features.
Current portal mappings include:
completion->TextGenerationembedding->TextEmbeddingaudio-transcription->SpeechToText
The portal deployment UI currently exposes VLLM and FasterWhisper as engine choices, even though the backend API supports additional engines.
Portal Repository Flow
The portal also exposes an MLflow repository view.
From there, users can:
- browse registered models and versions
- inspect the latest version status
- open a model detail page
- click
Deploy Modelto jump into the advanced deployment page with the artifact URI pre-filled
This is the cleanest user path when a model is already present in the local repository and the next step is only inference deployment.
Portal Downloader Flow
The portal has a separate Model Downloader page for importing Hugging Face models into the local repository.
That page:
- collects the Hugging Face model name and revision
- auto-generates an
MLflowexperiment name from the model name unless the user overrides it - defaults the artifact path to
model - starts a background
download_hf_modelrequest - polls
MLflowuntil the imported model becomesREADY - checks Model Installer pod logs to detect download failures
An important current behavior is that the downloader imports into MLflow, but does not automatically deploy the model for inference in the current portal flow.
The portal sends register_inference: false for that path, so deployment still happens as a separate step afterward.
Endpoint Summary
| Endpoint | Purpose | Used in portal today |
|---|---|---|
POST /register_model | Register a serving deployment in Kubernetes and proxy | Yes |
DELETE /unregister_model/{name}/{namespace} | Remove a serving deployment | Yes |
POST /download_hf_model | Import a Hugging Face model into MLflow | Yes |
POST /download_s3_model | Import a model from S3 into MLflow | Not in the current portal flow |
POST /register_s3_model | Register an existing S3 URI directly in MLflow | Not in the current portal flow |
DELETE /delete_model | Delete model records and optionally artifacts from MLflow and storage | Yes |
Authentication And Access
The REST API is protected with OAuth2 bearer tokens and validates them through Keycloak token introspection.
In the portal, model management actions are gated by the can_manage_models permission. The portal then acquires a backend client-credentials token before calling the Model Installer service.
Direct Deploy Vs Repository Import
Use direct deploy when:
- you already know the model URL to serve
- the main goal is to expose the model quickly in the cluster
- you do not need a separate local repository onboarding step first
Use repository import first when:
- you want a managed
MLflowentry and version history - you want a reusable local artifact source for later deployments
- you want operators to deploy from repository entries instead of raw model URLs