Mini vLLM Inference Server

A from-scratch LLM serving engine with continuous batching and a KV cache.

PythonFastAPICUDA

Overview#

A from-scratch large language model serving engine built to understand the internals of high-throughput inference: continuous batching, a paged KV cache, and a simple scheduler.

Highlights#

Continuous batching to keep the GPU saturated across requests.
KV cache management to reuse computation across decoding steps.
A small FastAPI surface for submitting and streaming completions.