The vLLM real-world lab models mixed production traffic instead of a single throughput number. The Agent Foundry generated the request mixes, ran the scheduler and routing profiles, captured build outcomes, and emitted the evidence used for the charts below. The lab covers six request classes: interactive chat, RAG with repeated prefixes, long-prefill requests, agent tool loops, batch summarization, and slow streaming clients.
The adapter matrix compares vLLM V1, SGLang, llama.cpp, and TGI through the same OpenAI-compatible contract. The serving sweep covers the profiles operators actually tune: one balanced pool, a large-token pool, a small interactive pool, prefix-cache routing, slow-client isolation, and class-aware routing.
Main result: one global vLLM pool is a poor default for mixed traffic. The best profile was vllm-v1/class-aware-router, which gave the strongest combined tradeoff across first-token latency, token cadence, slow-reader isolation, and useful throughput.
The practical advice is to split lanes before changing kernels. Keep interactive traffic on a smaller max_num_batched_tokens budget with moderate max_num_seqs, then move long-context and batch work to a separate pool with a larger token budget.
Chunked prefill needs the same treatment. Keep max_long_partial_prefills below max_num_partial_prefills so short prompts can still enter the scheduler while long prompts are being processed. Treat token budget, sequence count, partial-prefill limits, and stream interval as workload controls.
The scale run made the routing result clearer. The single shared pool failed the TTFT/ITL gate under the larger workload, while class-aware routing accepted nearly all requests. Prefix-cache-only and slow-client-only profiles helped their own lanes, but neither worked as a whole-system strategy.
The source-build run added useful deployment checks. Debian needed a larger rootfs and then hit a GCC version gate. Fedora 44 with GCC 16, Python 3.14, VLLM_TARGET_DEVICE=cpu, MAX_JOBS=4, and numactl-devel built the extension modules successfully. The build also surfaced a deployment note worth keeping: install tcmalloc for better runtime performance.
The hybrid KV lab covers the PagedAttention rewrite path. It keeps vLLM's block ownership, prefix sharing, refcounts, partial blocks, eviction, and copy-on-write semantics, then exposes compact logical spans to future kernels instead of a long per-token block table.
The correctness oracle compares each layout against a plain reference across odd lengths, MQA/GQA, ALiBI, sliding windows, prefix reuse, copy-on-write, partial final blocks, and FP8-style scaling. The first run passed 30/30 cases and ranked virtual-contiguous and hybrid-prefix-shared as the first layouts to profile on hardware.