On-device Large Models

Overview

We develop techniques to deploy foundation models on edge devices under tight memory and latency budgets, combining model compression (quantization, distillation, pruning/sparsity) with runtime and system optimizations.

Techniques

Quantization-aware training and post-training quantization for LLMs
Task- and domain-aware distillation from LLMs to compact students
Structured sparsity and low-rank adaptation for fast inference
Runtime co-design: caching, scheduling, and memory-aware batching

Outcomes

Quantized LLM on Mobile

Prototype deployment of a quantized LLM with memory-aware decoding pipeline.

Ongoing On-device Quantization

Overview

Techniques

Outcomes

Quantized LLM on Mobile