1

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Identified the underutilization of resources within a single device in existing serving systems due to sequentially executing operations with various resource requirements
Overlapping the utilization of different resources within a single device through operation co-scheduling for higher resource utilization

Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Identified the sparsity in the attention mechanism of long-context LLMs
Dynamically choose critical tokens based on the query

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Use activation and weight low-bit quantization to improve throughput.

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

From Optimal to Practical: Efficient Micro-op Cache Replacement Policies for Data Center Applications

Understand the limitations of state-of-the-art replacement policy and the uniqueness of micro-op cache.
Proposed and evaluated counter-based, profile-guided replacement policy

Kan Zhu, Yilong Zhao, Yufei Gao, Peter Braun, Tanvir Ahmed Khan, Heiner Litz, Baris Kasikci, Shuwen Deng

From Optimal to Practical: Efficient Micro-op Cache Replacement Policies for Data Center Applications

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models.

We propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models.

Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models.