Mini vLLM Inference Server
A from-scratch LLM serving engine with continuous batching and a KV cache.
PythonFastAPICUDA
Overview#
A from-scratch large language model serving engine built to understand the internals of high-throughput inference: continuous batching, a paged KV cache, and a simple scheduler.
Highlights#
- Continuous batching to keep the GPU saturated across requests.
- KV cache management to reuse computation across decoding steps.
- A small FastAPI surface for submitting and streaming completions.