Overview

We develop efficient inference techniques for large language models, focusing on reducing memory and latency while preserving quality. Directions include KV cache pruning/merging, prompt compression, streaming state retention, and adaptive decoding.

Outcomes