Low-cost LLM Deployment — Xiaomi Open-Competition Research Program

Illustration of low-cost deployment pipeline (quantization, distillation, KV cache, runtime co-design).

Overview

This project targets cost-efficient deployment of large language models across edge and consumer devices. We study quantization, distillation, structured sparsity, KV cache strategies, prompt compression, and runtime co-design to reduce memory and latency under real-world constraints.

Technical Tracks

Quantization and Distillation

Lower precision kernels and student models tailored for target devices.

2024–2025 On-device
KV Cache and Prompt Compression

Memory-aware KV strategies and compact prompts for long-context usage.

2024–2025 Efficiency
Runtime Co-design

Scheduling, caching, and memory-aware batching for device-specific runtimes.

2024–2025 Systems

Outputs

Targeting technical reports, open-source toolkits, and deployment case studies.

Overview

Technical Tracks

Quantization and Distillation

KV Cache and Prompt Compression

Runtime Co-design

Outputs