

vLLM
#1 in LLM Inference & Servingvllm · seit Juni 2023 (offizielles erstes Release) · 40× · zuletzt 30. Juni 2026
vLLM is an open-source inference and serving engine for Large Language Models (LLMs), originally developed at UC Berkeley's Sky Computing Lab and maintained as a community project since 2023. Its core architecture is based on PagedAttention (virtual memory management of the KV cache) and continuous batching, delivering significantly higher throughput than naive serving approaches. vLLM supports 200+ model architectures from Hugging Face and runs on a broad range of hardware accelerators. The project is free to use (Apache 2.0) and is backed by an ecosystem of over 2,000 contributors and sponsors including NVIDIA, AMD, Google, AWS, and Intel.
Features
| License | Apache License 2.0 |
| Price | Free / Open Source (no license fees; donations via GitHub & OpenCollective) |
| Release Date | June 2023 (first official release); currently v0.24.0 on PyPI (as of July 2026) |