Components
vLLM
High-throughput LLM inference engine in the Runtime layer.
Component Category
Inference / LLM serving engine
Component Description
vLLM is a high-throughput and memory-efficient inference engine designed for serving large language models.
Why It Is Used
In BullSequana AI Runtime, vLLM powers efficient LLM inference with strong performance characteristics for production workloads, especially where throughput and GPU utilization matter.
Learn More
Interacts With
KubeAI, which uses vLLM as one of its inference backends for model serving.Model Installerand other serving workflows, which deploy or operate models on top of vLLM-backed runtimes.