vLLM

Component Category

Inference / LLM serving engine

Component Description

vLLM is a high-throughput and memory-efficient inference engine designed for serving large language models.

Why It Is Used

In BullSequana AI Runtime, vLLM powers efficient LLM inference with strong performance characteristics for production workloads, especially where throughput and GPU utilization matter.

Learn More

vLLM documentation
vllm-project/vllm on GitHub

Interacts With

KubeAI, which uses vLLM as one of its inference backends for model serving.
Model Installer and other serving workflows, which deploy or operate models on top of vLLM-backed runtimes.