Components

vLLM

High-throughput LLM inference engine in the Runtime layer.

Agentic Friendly

Component Category

Inference / LLM serving engine

Component Description

vLLM is a high-throughput and memory-efficient inference engine designed for serving large language models.

Why It Is Used

In BullSequana AI Runtime, vLLM powers efficient LLM inference with strong performance characteristics for production workloads, especially where throughput and GPU utilization matter.

Learn More

Interacts With

  • KubeAI, which uses vLLM as one of its inference backends for model serving.
  • Model Installer and other serving workflows, which deploy or operate models on top of vLLM-backed runtimes.

On this page