Overview
We develop techniques to deploy foundation models on edge devices under tight memory and latency budgets, combining model compression (quantization, distillation, pruning/sparsity) with runtime and system optimizations.
Techniques
- Quantization-aware training and post-training quantization for LLMs
- Task- and domain-aware distillation from LLMs to compact students
- Structured sparsity and low-rank adaptation for fast inference
- Runtime co-design: caching, scheduling, and memory-aware batching