vLLM Real-World Lab Report

The vLLM real-world lab models mixed production traffic instead of a single throughput number. The Agent Foundry generated the request mixes, ran the scheduler and routing profiles, captured build outcomes, and emitted the evidence used for the charts below. The lab covers six request classes: interactive chat, RAG with repeated prefixes, long-prefill requests, agent tool loops, batch summarization, and slow streaming clients.

The adapter matrix compares vLLM V1, SGLang, llama.cpp, and TGI through the same OpenAI-compatible contract. The serving sweep covers the profiles operators actually tune: one balanced pool, a large-token pool, a small interactive pool, prefix-cache routing, slow-client isolation, and class-aware routing.

Main result: one global vLLM pool is a poor default for mixed traffic. The best profile was vllm-v1/class-aware-router, which gave the strongest combined tradeoff across first-token latency, token cadence, slow-reader isolation, and useful throughput.

The practical advice is to split lanes before changing kernels. Keep interactive traffic on a smaller max_num_batched_tokens budget with moderate max_num_seqs, then move long-context and batch work to a separate pool with a larger token budget.

Chunked prefill needs the same treatment. Keep max_long_partial_prefills below max_num_partial_prefills so short prompts can still enter the scheduler while long prompts are being processed. Treat token budget, sequence count, partial-prefill limits, and stream interval as workload controls.

The scale run made the routing result clearer. The single shared pool failed the TTFT/ITL gate under the larger workload, while class-aware routing accepted nearly all requests. Prefix-cache-only and slow-client-only profiles helped their own lanes, but neither worked as a whole-system strategy.

The source-build run added useful deployment checks. Debian needed a larger rootfs and then hit a GCC version gate. Fedora 44 with GCC 16, Python 3.14, VLLM_TARGET_DEVICE=cpu, MAX_JOBS=4, and numactl-devel built the extension modules successfully. The build also surfaced a deployment note worth keeping: install tcmalloc for better runtime performance.

The hybrid KV lab covers the PagedAttention rewrite path. It keeps vLLM's block ownership, prefix sharing, refcounts, partial blocks, eviction, and copy-on-write semantics, then exposes compact logical spans to future kernels instead of a long per-token block table.

The correctness oracle compares each layout against a plain reference across odd lengths, MQA/GQA, ALiBI, sliding windows, prefix reuse, copy-on-write, partial final blocks, and FP8-style scaling. The first run passed 30/30 cases and ranked virtual-contiguous and hybrid-prefix-shared as the first layouts to profile on hardware.

routing shape

Six request classes, separate serving lanes, one scheduler budget problem.

Request Mix

interactive

30%

RAG prefix

28%

agent loop

12%

long prefill

10%

batch

10%

slow stream

10%

Router Policy

short promptssmall token budget

shared prefixescache-aware lane

long contextpartial prefill

slow readersedge isolation

KV Span Sketch

logical8192..12287

physical run44..75

prefixshared + refcounted

The kernel-facing idea is a compact span table, while the allocator still owns fragmented paged blocks.

Profile Stress Result

600k Request Stress Result

Profile Matrix

Profile	Goodput	TTFT	ITL

Tuning Read

One global pool is not a safe production default for mixed traffic. Under the 600k-request stress, it produced zero gated goodput.
A larger global token budget is not a router. It improved the single-pool control but still missed the gate with p99 TTFT over one second.
Prefix-cache routing is valuable but incomplete. It protected RAG and agent-tool-loop requests while starving ordinary interactive, batch, long-prefill, and slow-client classes.
Slow-client isolation fixed slow-reader penalty but did not solve long-prefill queueing. Use it as an edge/proxy lane inside a broader router.
The winning shape is class-aware: short interactive, prefix-heavy, long-prefill, batch, and slow-stream lanes with different token budgets and sequence limits.

Source Build Lab

Build Outcomes

Fedora source-build recipe

This was the passing path in the lab: Fedora 44, GCC 16, Python 3.14, 12 GiB rootfs, numactl-devel, and limited parallelism.

dnf install -y git gcc gcc-c++ cmake ninja-build python3.14 python3.14-devel numactl-devel
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 1c607d7b2cd4fb572b919c6053f19d0577203495
python3.14 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip setuptools wheel
VLLM_TARGET_DEVICE=cpu MAX_JOBS=4 python -m pip install -v -e .

Debian source-build preflight

Debian 12 exposed two checks before a successful build: disk space for the PyTorch wheel and GCC/G++ version. Use a larger rootfs and a compiler at or above the x86 backend gate.

apt-get update
apt-get install -y git cmake ninja-build python3 python3-venv python3-dev libnuma-dev
# Debian 12 GCC/G++ 12.2 failed the current x86 backend gate.
# Install a newer GCC/G++ toolchain first, then set CC/CXX.
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 1c607d7b2cd4fb572b919c6053f19d0577203495
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip setuptools wheel
CC=gcc-13 CXX=g++-13 VLLM_TARGET_DEVICE=cpu MAX_JOBS=4 python -m pip install -v -e .

Hybrid KV Rewrite Lab

Paged Blocks To Virtual Spans

Layout Metrics

Layout	Read Amp	Waste	Cost

KV Block Allocator Animation

Allocator Model

KV memory is split into fixed-size physical blocks instead of one contiguous allocation per request.
Requests with token-identical prefixes can point at the same prefix blocks and raise refcounts.
When a shared sequence diverges, copy-on-write gives the diverging request private blocks without corrupting the shared prefix.
The hybrid path keeps those allocator semantics but emits compact logical spans so kernels can read longer runs.

Performance Chart

Top Profiles

Profile	TTFT	ITL	Goodput

Interactive Request Router

Workload Mix

How vLLM Works

1. API admission

The OpenAI-compatible server accepts a request, parses sampling parameters, tokenizes or receives token IDs, and hands work to the async engine.

TTFT pressure	queueing, tokenization, long prefill
ITL pressure	decode cadence, stream interval, slow clients
Tuning surface	router lanes, max_num_batched_tokens, max_num_seqs

Restaurant Kitchen Analogy

Plain-English Mapping

Order tickets are user requests waiting for the scheduler.
Prep work is prefill: reading the prompt before any answer appears.
The first plated item is TTFT: the first token the user can see.
Plating one item at a time is decode: repeated next-token work.
The prep table is KV cache: stored context that avoids redoing work.
Servers carrying plates are streaming clients; slow tables can back up service.

Traffic Intersection View

Memory Hotel Map

Token Timeline Scrubber

Latency Read

Queue is time before the scheduler admits the request into an engine step.
Prefill processes prompt tokens and usually dominates TTFT for long contexts.
First token is the user-visible TTFT boundary.
Decode advances active requests token by token; this is where ITL shows up.
Streaming can stretch visible cadence even when the engine is healthy.

Scheduler Budget Explorer

High-Level Model

Requests enter the OpenAI-compatible API and wait for an engine step.
The scheduler spends each step's token budget on prefill and decode work.
Prefill is prompt processing; long prompts need chunking so decode is not starved.
Decode advances many active requests by the next token through continuous batching.
KV cache blocks are the memory constraint; prefix reuse helps only when tokens match.
Streaming turns engine cadence into client-visible TTFT and inter-token gaps.

Findings

GPU Deploy Template

This is the deployment shape used by The Agent Foundry for the lab: vLLM runs behind a small router, chunked prefill is enabled, and the scheduler limits are explicit so each serving lane can be changed and measured independently.

services:
  vllm:
    image: vllm/vllm-openai:latest
    command:
      - --model=<your-model>
      - --host=0.0.0.0
      - --port=8000
      - --enable-chunked-prefill
      - --max-num-batched-tokens=4096
      - --max-num-seqs=128
      - --max-num-partial-prefills=2
      - --max-long-partial-prefills=1
      - --long-prefill-token-threshold=4096
    ports: ["8000:8000"]
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]

  router:
    image: nginx:stable
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
    ports: ["8080:8080"]
    depends_on: [vllm]

References

vLLM optimization and tuning vLLM scheduler and engine arguments vLLM bench serve PagedAttention paper vAttention paper vLLM source commit used in builds vLLM CPU extension CMake vLLM setup.py build surface