Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.03598
Cited By
Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
9 January 2023
Muhammad Osama
D. Merrill
C. Cecka
M. Garland
John Douglas Owens
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"
13 / 13 papers shown
Title
Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
Yaoyao Ding
Bohan Hou
X. Zhang
Allan Lin
Tianqi Chen
Cody Yu Hao
Yida Wang
Gennady Pekhimenko
43
0
0
17 Apr 2025
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Han Guo
William Brandon
Radostin Cholakov
Jonathan Ragan-Kelley
Eric P. Xing
Yoon Kim
MQ
83
12
0
20 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
65
21
0
02 Jan 2025
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
Chao Zeng
Songwei Liu
Shu Yang
Fangmin Chen
Xing Mei
Lean Fu
MQ
42
0
0
23 Dec 2024
SSSD: Simply-Scalable Speculative Decoding
Michele Marzollo
Jiawei Zhuang
Niklas Roemer
Lorenz K. Müller
Lukas Cavigelli
LRM
39
2
0
08 Nov 2024
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
R. Ramjee
Ashish Panwar
51
9
0
23 Oct 2024
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Elias Frantar
Roberto L. Castro
Jiale Chen
Torsten Hoefler
Dan Alistarh
MQ
24
11
0
21 Aug 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
48
113
0
11 Jul 2024
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Li-Wen Chang
Wenlei Bao
Qi Hou
Chengquan Jiang
Ningxin Zheng
...
Zuquan Song
Ziheng Jiang
Haibin Lin
Xin Jin
Xin Liu
36
19
0
11 Jun 2024
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar
Srikant Bharadwaj
Renée St. Amant
Victor Rühle
Saravan Rajmohan
49
6
0
17 May 2024
Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition
Adnan Hoque
Less Wright
Chih-Chieh Yang
M. Srivatsa
R. Ganti
11
1
0
05 Jan 2024
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
Ruihang Lai
Junru Shao
Siyuan Feng
Steven Lyubomirsky
Bohan Hou
...
Sunghyun Park
Prakalp Srivastava
Jared Roesch
T. Mowry
Tianqi Chen
45
9
0
01 Nov 2023
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Abhinav Jangda
Saeed Maleki
M. Dehnavi
Madan Musuvathi
Olli Saarikivi
22
5
0
22 May 2023
1