Skip to content

Mini vLLM Inference Server

A from-scratch LLM serving engine with continuous batching and a KV cache.

PythonFastAPICUDA

Overview#

A from-scratch large language model serving engine built to understand the internals of high-throughput inference: continuous batching, a paged KV cache, and a simple scheduler.

Highlights#

  • Continuous batching to keep the GPU saturated across requests.
  • KV cache management to reuse computation across decoding steps.
  • A small FastAPI surface for submitting and streaming completions.