Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.14135
Cited By
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
27 May 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
50 / 1,418 papers shown
Title
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou
Qian Liu
Fan Zhou
Changyu Chen
Zili Wang
...
Tianyu Pang
Chao Du
Xinyi Wan
Wei Lu
Min Lin
82
1
0
18 Feb 2025
MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
Sihyun Yu
Meera Hahn
Dan Kondratyuk
Jinwoo Shin
Agrim Gupta
José Lezama
Irfan Essa
David A. Ross
Jonathan Huang
DiffM
VGen
67
0
0
18 Feb 2025
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
J. Wang
Hansong Zhou
Ting Song
Shijie Cao
Yan Xia
Ting Cao
Jianyu Wei
Shuming Ma
Hongyu Wang
Furu Wei
53
0
0
17 Feb 2025
Understanding Silent Data Corruption in LLM Training
Jeffrey Ma
Hengzhi Pei
Leonard Lausen
George Karypis
32
0
0
17 Feb 2025
Low-Rank Thinning
Annabelle Michael Carrell
Albert Gong
Abhishek Shetty
Raaz Dwivedi
Lester W. Mackey
53
0
0
17 Feb 2025
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Zichen Wen
Yifeng Gao
Weijia Li
Conghui He
Linfeng Zhang
LRM
52
0
0
17 Feb 2025
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni
Yuxin Guo
Yichen Liu
Rui Chen
Lewei Lu
Z. Wu
DiffM
VGen
59
3
0
17 Feb 2025
Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing
Boyuan Chen
Mingzhi Zhu
Brendan Dolan-Gavitt
Muhammad Shafique
Siddharth Garg
94
1
0
17 Feb 2025
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Yuxiang Huang
Mingye Li
Xu Han
Chaojun Xiao
Weilin Zhao
Sun Ao
Hao Zhou
Jie Zhou
Zhiyuan Liu
Maosong Sun
42
0
0
17 Feb 2025
System Message Generation for User Preferences using Open-Source Models
Minbyul Jeong
Jungho Cho
Minsoo Khang
Dawoon Jung
Teakgyu Hong
39
0
0
17 Feb 2025
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity
Junhao Hu
Wenrui Huang
Weidong Wang
Zhenwen Li
Tiancheng Hu
Zhixia Liu
Xusheng Chen
Tao Xie
Yizhou Shan
LRM
43
0
0
16 Feb 2025
KernelBench: Can LLMs Write Efficient GPU Kernels?
Anne Ouyang
Simon Guo
Simran Arora
Alex L. Zhang
William Hu
Christopher Ré
Azalia Mirhoseini
ALM
38
1
0
14 Feb 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
OffRL
43
2
0
14 Feb 2025
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Heejun Lee
G. Park
Jaduk Suh
Sung Ju Hwang
82
1
0
13 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
71
1
0
12 Feb 2025
Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Qingshui Gu
Shu Li
Tianyu Zheng
Zhaoxiang Zhang
115
0
0
10 Feb 2025
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
Yang Zhou
Hongyi Liu
Zhuoming Chen
Yuandong Tian
Beidi Chen
LRM
55
7
0
07 Feb 2025
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Feng Wang
Yaodong Yu
Guoyizhe Wei
Wei Shao
Yuyin Zhou
Alan Yuille
Cihang Xie
ViT
85
4
0
06 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
K. K.
Amir Gholami
MQ
39
1
0
05 Feb 2025
Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models
Chia-Wen Kuo
Sijie Zhu
Fan Chen
Xiaohui Shen
Longyin Wen
VLM
60
1
0
04 Feb 2025
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu
Zhenheng Tang
Hong Chen
Peijie Dong
Zeyu Li
Xiuze Zhou
Bo Li
Xuming Hu
Xiaowen Chu
78
3
0
04 Feb 2025
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao Chang
Shivaram Venkataraman
VLM
80
0
0
04 Feb 2025
Unified Spatial-Temporal Edge-Enhanced Graph Networks for Pedestrian Trajectory Prediction
Ruochen Li
Tanqiu Qiao
Stamos Katsigiannis
Zhanxing Zhu
Hubert P. H. Shum
AI4TS
63
1
0
04 Feb 2025
Adaptive Self-improvement LLM Agentic System for ML Library Development
Genghan Zhang
Weixin Liang
Olivia Hsu
K. Olukotun
61
0
0
04 Feb 2025
Transformers trained on proteins can learn to attend to Euclidean distance
Isaac Ellmen
Constantin Schneider
Matthew I.J. Raybould
Charlotte M. Deane
72
0
0
03 Feb 2025
Scalable Language Models with Posterior Inference of Latent Thought Vectors
Deqian Kong
Minglu Zhao
Dehong Xu
Bo Pang
Shu Wang
...
Zhangzhang Si
Chuan Li
Jianwen Xie
Sirui Xie
Ying Nian Wu
VLM
LRM
BDL
73
5
0
03 Feb 2025
Vision-centric Token Compression in Large Language Model
Ling Xing
Alex Jinpeng Wang
Rui Yan
J. Tang
VLM
60
0
0
02 Feb 2025
Position: AI Scaling: From Up to Down and Out
Yunke Wang
Yanxi Li
Chang Xu
HAI
71
1
0
02 Feb 2025
PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
Songhao Wu
Ang Lv
Xiao Feng
Y. Zhang
Xun Zhang
Guojun Yin
Wei Lin
Rui Yan
MQ
50
0
0
01 Feb 2025
Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques
Nathaniel Tomczak
Sanmukh Kuppannagari
89
0
0
31 Jan 2025
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Zhihang Lin
Mingbao Lin
Luxi Lin
Rongrong Ji
51
16
0
28 Jan 2025
GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments
Yanyu Chen
Ganhong Huang
98
0
0
28 Jan 2025
Qwen2.5-1M Technical Report
A. Yang
Bowen Yu
C. Li
Dayiheng Liu
Fei Huang
...
Xingzhang Ren
Xinlong Yang
Y. Li
Zhiying Xu
Z. Zhang
63
10
0
28 Jan 2025
SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs
Mohammad Mozaffari
Amir Yazdanbakhsh
Zhao Zhang
M. Dehnavi
62
5
0
28 Jan 2025
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng
Junlin Lv
Yukun Cao
Xike Xie
S. K. Zhou
VLM
53
27
0
28 Jan 2025
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion
Zhan Ling
Kang Liu
Kai Yan
Y. Yang
Weijian Lin
Ting-Han Fan
Lingfeng Shen
Zhengyin Du
Jiecao Chen
ReLM
ELM
LRM
40
2
0
25 Jan 2025
Wormhole Memory: A Rubik's Cube for Cross-Dialogue Retrieval
Libo Wang
47
0
0
24 Jan 2025
ZETA: Leveraging Z-order Curves for Efficient Top-k Attention
Qiuhao Zeng
Jerry Huang
Peng Lu
Gezheng Xu
Boxing Chen
Charles X. Ling
Boyu Wang
45
1
0
24 Jan 2025
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
Jianing Yang
Alexander Sax
Kevin J Liang
Mikael Henaff
Hao Tang
Ang Cao
J. Chai
Franziska Meier
Matt Feiszli
3DGS
66
16
0
23 Jan 2025
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
Kaiyuan Tian
Linbo Qiao
Baihui Liu
Gongqingjian Jiang
Dongsheng Li
31
0
0
21 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
53
1
0
15 Jan 2025
Estimating Musical Surprisal in Audio
Mathias Rose Bjare
Giorgia Cantisani
Stefan Lattner
Gerhard Widmer
39
0
0
13 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Junming Ma
Jin Tan
Runsheng Wang
Meng Li
47
0
0
12 Jan 2025
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
Shunfan Zheng
Xiechi Zhang
Gerard de Melo
Xiaoling Wang
Linlin Wang
LM&MA
ELM
31
0
0
12 Jan 2025
Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Yuwei Fang
Kwot Sin Lee
Ivan Skorokhodov
Kfir Aberman
Jun-Yan Zhu
Ming Yang
Sergey Tulyakov
DiffM
VGen
69
7
0
10 Jan 2025
Clinical Insights: A Comprehensive Review of Language Models in Medicine
Nikita Neveditsin
Pawan Lingras
V. Mago
LM&MA
46
3
0
08 Jan 2025
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Yueqin Yin
Shentao Yang
Yujia Xie
Ziyi Yang
Yuting Sun
Hany Awadalla
Weizhu Chen
Mingyuan Zhou
48
0
0
07 Jan 2025
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Esha Choukse
Haoran Qiu
Rodrigo Fonseca
Josep Torrellas
Ricardo Bianchini
32
4
0
05 Jan 2025
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang
Zhemeng Yu
Gabriele Spadaro
Chen Ju
Victor Quétu
Enzo Tartaglione
Enzo Tartaglione
VLM
69
3
0
05 Jan 2025
FAST: Fast Audio Spectrogram Transformer
Anugunj Naman
Gaibo Zhang
26
0
0
03 Jan 2025
Previous
1
2
3
4
5
6
...
27
28
29
Next