Model Installer API

How CoreAI exposes the Model Installer API.

Agentic Friendly

The Model Installer API is the CoreAI service used to turn model locations and model artifacts into actual serving deployments.

It sits between model sources such as Hugging Face, S3-compatible storage, and MLflow on one side, and the Runtime serving path on the other side.

In practice, this is the service that lets teams do two kinds of work:

  • import models into the local MLflow repository
  • register a model for inference so it becomes available through the platform serving layer

What The Service Actually Does

When a model is registered for inference, the service does more than just store metadata.

It:

  • creates a kubeai.org/v1 Model resource in Kubernetes
  • registers the same model in the internal proxy layer used for OpenAI-compatible access
  • keeps long-running import flows asynchronous when artifact download or upload takes time
  • protects the API with Keycloak token validation

For direct inference registration, the request goes through POST /register_model.

For artifact import and repository management, the service also exposes endpoints for:

  • POST /download_hf_model
  • POST /download_s3_model
  • POST /register_s3_model
  • DELETE /delete_model
  • DELETE /unregister_model/{name}/{namespace}

Choose The Right Path

Use these rules of thumb:

  • If you already have a model URL that the serving layer can use, register it directly with register_model.
  • If you want the model stored in the platform's local repository first, import it into MLflow, then deploy it from the repository.
  • If weights already exist in S3 or MinIO and you do not want to copy them again, use register_s3_model.

Direct Deployment For Inference

The main inference deployment endpoint is:

POST /register_model

The backend schema requires these core fields:

  • name
  • engine
  • features
  • url

Common deployment fields include:

  • namespace, defaulting to kubeai
  • resourceProfile, in the format <profile>:<count> such as nvidia-gpu-l4:1
  • replicas, minReplicas, and maxReplicas
  • timeout, stream_timeout, and max_retries
  • model_mode and model metadata such as token limits or embedding dimensions

The service supports these serving engines at API level:

  • VLLM
  • OLlama
  • FasterWhisper
  • Infinity

The supported URL patterns depend on the engine and storage path:

  • hf://<org>/<model>
  • pvc://<pvcName> or pvc://<pvcName>/<subpath>
  • s3://<bucket>/<path>
  • gs://<bucket>/<path>
  • oss://<bucket>/<path>
  • ollama://<model> for OLlama

For object storage URLs such as s3://, gs://, and oss://, the API supports cache-based serving with cacheProfile.

Example: Deploy A Model Directly

{
  "name": "mistral-small",
  "namespace": "kubeai",
  "engine": "VLLM",
  "features": ["TextGeneration"],
  "url": "hf://mistralai/Mistral-Small-3.2-24B-Instruct-2506",
  "resourceProfile": "nvidia-gpu-l4:1",
  "replicas": 1,
  "minReplicas": 0,
  "maxReplicas": 2,
  "timeout": 30,
  "stream_timeout": 1,
  "max_retries": 5,
  "model_mode": "chat",
  "max_input_tokens": 8192,
  "max_output_tokens": 2048
}

After a successful registration, the service creates the Kubernetes model resource and also registers the model in the proxy layer so it can be surfaced through the platform's OpenAI-compatible path.

Importing Models Into The Local Repository

The local repository path is built around MLflow.

This is useful when you want a platform-managed artifact location and a reusable repository entry before deployment.

Hugging Face Import

POST /download_hf_model

This flow:

  • downloads the model from the Hugging Face Hub
  • uploads the artifacts to MLflow
  • registers a model version in MLflow
  • optionally registers the imported model for inference if register_inference is set to true

These jobs run in the background.

S3 Import With Copy

POST /download_s3_model

This flow downloads model artifacts from S3-compatible storage and then uploads them into MLflow.

The API accepts direct credentials or an OpenBao secret reference for S3 access.

S3 Registration Without Copy

POST /register_s3_model

This is the lighter-weight path when the model is already present in object storage.

Instead of downloading and re-uploading the weights, the service registers the S3 URI directly as the model source in MLflow.

That makes it the better choice when:

  • artifacts are already in a stable bucket
  • duplicate storage is not wanted
  • you still want an MLflow model entry and version history

How The Portal Uses It

The CoreAI Portal uses the service in two main ways: model deployment and model repository import.

Portal Deployment Flow

For deployment, the portal ultimately sends a POST /register_model request.

The user-facing flow looks like this:

  1. Open the model deployment flow from the Models area.
  2. Choose either an easy preset-driven path or the advanced deployment form.
  3. Fill in the model source, deployment name, engine, resource profile, scaling, and optional advanced settings.
  4. Submit the form so the portal sends the deployment payload to the Model Installer service.

Easy Setup In The Portal

The easy setup path is preset-based.

It:

  • lets users browse curated model presets by category such as llm, embedding, audio, and multimodal
  • pre-fills deployment defaults such as engine, resource profile, token limits, tags, and features
  • keeps most preset fields read-only in simple mode
  • allows switching to Customize Deployment when teams need to tune the deployment

In easy mode, the portal also pre-fills:

  • namespace as kubeai
  • owner from the signed-in user email

Advanced Setup In The Portal

The advanced form is closer to the raw API.

It has separate steps for:

  • model information
  • deployment configuration
  • advanced configuration

The portal exposes a Model URL selector with two sources:

  • Manual URL
  • From Repository

When From Repository is used, the portal:

  • fetches registered MLflow models
  • allows selection of only READY versions
  • resolves the artifact URI for the latest usable version
  • pre-fills the deployment form with that artifact-backed URL

The portal also auto-generates the deployment name from the chosen URL when possible.

Portal Field Mapping

The portal does a small amount of transformation before calling the backend.

  • It combines resourceProfile and instances into the API's resourceProfile format, such as nvidia-gpu-l4:2.
  • It maps model mode choices into serving features.

Current portal mappings include:

  • completion -> TextGeneration
  • embedding -> TextEmbedding
  • audio-transcription -> SpeechToText

The portal deployment UI currently exposes VLLM and FasterWhisper as engine choices, even though the backend API supports additional engines.

Portal Repository Flow

The portal also exposes an MLflow repository view.

From there, users can:

  • browse registered models and versions
  • inspect the latest version status
  • open a model detail page
  • click Deploy Model to jump into the advanced deployment page with the artifact URI pre-filled

This is the cleanest user path when a model is already present in the local repository and the next step is only inference deployment.

Portal Downloader Flow

The portal has a separate Model Downloader page for importing Hugging Face models into the local repository.

That page:

  • collects the Hugging Face model name and revision
  • auto-generates an MLflow experiment name from the model name unless the user overrides it
  • defaults the artifact path to model
  • starts a background download_hf_model request
  • polls MLflow until the imported model becomes READY
  • checks Model Installer pod logs to detect download failures

An important current behavior is that the downloader imports into MLflow, but does not automatically deploy the model for inference in the current portal flow.

The portal sends register_inference: false for that path, so deployment still happens as a separate step afterward.

Endpoint Summary

EndpointPurposeUsed in portal today
POST /register_modelRegister a serving deployment in Kubernetes and proxyYes
DELETE /unregister_model/{name}/{namespace}Remove a serving deploymentYes
POST /download_hf_modelImport a Hugging Face model into MLflowYes
POST /download_s3_modelImport a model from S3 into MLflowNot in the current portal flow
POST /register_s3_modelRegister an existing S3 URI directly in MLflowNot in the current portal flow
DELETE /delete_modelDelete model records and optionally artifacts from MLflow and storageYes

Authentication And Access

The REST API is protected with OAuth2 bearer tokens and validates them through Keycloak token introspection.

In the portal, model management actions are gated by the can_manage_models permission. The portal then acquires a backend client-credentials token before calling the Model Installer service.

Direct Deploy Vs Repository Import

Use direct deploy when:

  • you already know the model URL to serve
  • the main goal is to expose the model quickly in the cluster
  • you do not need a separate local repository onboarding step first

Use repository import first when:

  • you want a managed MLflow entry and version history
  • you want a reusable local artifact source for later deployments
  • you want operators to deploy from repository entries instead of raw model URLs

On this page