Illustration of low-cost deployment pipeline (quantization, distillation, KV cache, runtime co-design).
Overview
This project targets cost-efficient deployment of large language models across edge and consumer devices. We study quantization, distillation, structured sparsity, KV cache strategies, prompt compression, and runtime co-design to reduce memory and latency under real-world constraints.
Technical Tracks
Outputs
- Targeting technical reports, open-source toolkits, and deployment case studies.