Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2407.08608
Cited By
v1
v2 (latest)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
11 July 2024
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (23064★)
Papers citing
"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision"
50 / 136 papers shown
Fast LLM Post-training via Decoupled and Fastest-of-N Speculation
Rongxin Cheng
Kai Zhou
Xingda Wei
Siyuan Liu
Mingcong Han
...
Yeju Zhou
Baoquan Zhong
W. L. Xiao
Rong Chen
Haibo Chen
OffRL
LRM
525
0
0
24 Dec 2025
RELIC: Interactive Video World Model with Long-Horizon Memory
Yicong Hong
Yiqun Mei
Chongjian Ge
Yiran Xu
Yang Zhou
...
Eli Shechtman
Kalyan Sunkavalli
Feng Liu
Z. Li
Hao Tan
VGen
VLM
421
24
0
03 Dec 2025
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Xiaolong Li
Youping Gu
Xi Lin
Weijie Wang
Bohan Zhuang
206
2
0
03 Dec 2025
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
X. S. Hu
Zhanchao Zhou
Ruiqi Liang
Zehuan Li
Wei Wu
Jianguo Li
333
1
0
28 Nov 2025
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
Wanli Zhong
Haibo Feng
Zirui Zhou
Hanyang Peng
Shiqi Yu
MQ
374
1
0
26 Nov 2025
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
Xinguo Zhu
Shaohui Peng
Jiaming Guo
Yunji Chen
Qi Guo
...
Qirui Zhou
Ke Gao
Yanjun Wu
Chen Zhao
Ling Li
137
9
0
25 Nov 2025
Block Cascading: Training Free Acceleration of Block-Causal Video Models
Hmrishav Bandyopadhyay
Nikhil Pinnaparaju
Rahim Entezari
Jim Scott
Yi-Zhe Song
Varun Jampani
VGen
202
2
0
25 Nov 2025
HunyuanVideo 1.5 Technical Report
Bing Wu
Chang Zou
Changlin Li
Duojun Huang
Fang Yang
...
Zhihe Yang
Zilin Yang
Z. Lu
Zixiang Zhou
Zhao Zhong
DiffM
VGen
465
44
0
24 Nov 2025
NeAR: Coupled Neural Asset-Renderer Stack
Hong Li
Chongjie Ye
Houyuan Chen
Weiqing Xiao
Ziyang Yan
...
Yikai Wang
Baochang Zhang
Xiaoguang Han
Jiaolong Yang
Hao Zhao
MoE
228
0
0
23 Nov 2025
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Genghan Zhang
Shaowei Zhu
Anjiang Wei
Zhenyu Song
Allen Nie
Zhen Jia
Nandita Vijaykumar
Yida Wang
K. Olukotun
163
3
0
19 Nov 2025
Global Cross-Time Attention Fusion for Enhanced Solar Flare Prediction from Multivariate Time Series
Onur Vural
S. M. Hamdi
S. F. Boubrahimi
AI4TS
194
0
0
17 Nov 2025
MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity
Vladimír Macko
Vladimír Boža
167
3
0
17 Nov 2025
LEMUR: Large scale End-to-end MUltimodal Recommendation
Computers & graphics (CG), 2024
Xintian Han
Honggang Chen
Quan Lin
Jingyue Gao
X. Ren
...
Zhe Wang
Yuchao Zheng
Jingjian Lin
Di Wu
Junfeng Ge
OffRL
286
6
0
14 Nov 2025
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
Tianyu Fu
Yichen You
Z. Chen
Guohao Dai
Huazhong Yang
Yu Wang
LRM
250
7
0
11 Nov 2025
TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task
Özay Ezerceli
Gizem Gümüşçekiçci
Tuğba Erkoç
Berke Özenç
RALM
135
0
0
10 Nov 2025
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
Kelun Lei
Hailong Yang
H. Zhang
Xin You
Kaige Zhang
Zhongzhi Luan
Yi Liu
Depei Qian
213
7
0
09 Nov 2025
Hilbert-Guided Sparse Local Attention
Yunge Li
Lanyu Xu
175
0
0
08 Nov 2025
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Hui Zeng
Daming Zhao
Pengfei Yang
WenXuan Hou
Tianyang Zheng
Hui Li
Weiye Ji
Jidong Zhai
298
2
0
08 Nov 2025
Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation
Matteo Bastico
David Ryckelynck
Laurent Corté
Yannick Tillier
Etienne Decencière
425
2
0
07 Nov 2025
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
Lei Gao
Chaoyi Jiang
Hossein Entezari Zarch
Daniel Wong
M. Annavaram
126
1
0
06 Nov 2025
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Bozhi You
Irene Wang
Zelal Su "Lain" Mustafaoglu
Abhinav Jangda
Angélica Moreira
Roshan Dathathri
Divya Mahajan
Keshav Pingali
243
0
0
03 Nov 2025
MotionStream: Real-Time Video Generation with Interactive Motion Controls
Joonghyuk Shin
Zhengqi Li
Richard Zhang
Jun-Yan Zhu
Jaesik Park
Eli Schechtman
Xun Huang
DiffM
VGen
489
33
0
03 Nov 2025
Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
Mansi Choudhary
Karthik Sangaiah
Sonali Singh
Muhammad Osama
Lisa Wu Wills
Ganesh Dasika
103
0
0
03 Nov 2025
Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability
Hen-Hsen Huang
128
0
0
03 Nov 2025
Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse
Shaojie Wang
Jinghui Wang
Yinghan Cui
Xuxing Chen
Chao Wang
...
Xiaojiang Zhang
J. Peng
Li Wan
Haotian Zhang
Bin Chen
MoMe
218
2
0
01 Nov 2025
SpecAttn: Speculating Sparse Attention
Harsh Shah
164
0
0
31 Oct 2025
Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving
Y. Zhang
Hanyue Du
Chun Cao
Jingwei Xu
158
0
0
30 Oct 2025
Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
Kyungmin Lee
Sihyun Yu
Jinwoo Shin
AI4CE
302
7
0
28 Oct 2025
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Zijian Zhang
Rong Wang
Shiyang Li
Yuebo Luo
Mingyi Hong
Caiwen Ding
211
16
0
23 Oct 2025
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
Hongzheng Chen
Bin Fan
Alexander Collins
Bastian Hagedorn
Evghenii Gaburov
...
M. Brookhart
Chris Sullivan
Jason Knight
Zhiru Zhang
Vinod Grover
175
3
0
16 Oct 2025
video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM
Guangzhi Sun
Yixuan Li
Xiaodong Wu
Yudong Yang
Wei Li
Zejun Ma
Chao Zhang
127
1
0
13 Oct 2025
MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition
Maoliang Li
K. Li
Yaoyang Liu
Jiayu Chen
Zihao Zheng
Yinjun Wu
Xiang Chen
Xiang Chen
179
2
0
10 Oct 2025
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
Gunjun Lee
Jiwon Kim
Jaiyoung Park
Y. Lee
Jung Ho Ahn
MoE
164
1
0
09 Oct 2025
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Vasileios Titopoulos
K. Alexandridis
G. Dimitrakopoulos
156
0
0
08 Oct 2025
The Anatomy of a Triton Attention Kernel
Burkhard Ringlein
Jan van Lunteren
Radu Stoica
Thomas Parnell
118
2
0
07 Oct 2025
SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
Dachuan Shi
Abedelkadir Asi
Keying Li
Xiangchi Yuan
Leyan Pan
Wenke Lee
Wen Xiao
LRM
282
6
0
06 Oct 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander Fichtl
Jeremias Bohn
Josefin Kelber
Edoardo Mosca
Georg Groh
175
0
0
06 Oct 2025
Emergent Coordination in Multi-Agent Language Models
Christoph Riedl
LLMAG
181
1
0
05 Oct 2025
Accelerating Attention with Basis Decomposition
Jialin Zhao
198
0
0
02 Oct 2025
Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework
Nii Osae Osae Dade
Moinul Hossain Rahat
171
0
0
02 Oct 2025
A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration
Rohit Jena
Vedant Zope
Pratik Chaudhari
James C. Gee
FedML
159
0
0
29 Sep 2025
UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation
Guanjun Wu
Jiemin Fang
Chen Yang
Sikuang Li
Taoran Yi
...
Xiaopeng Zhang
Wei Wei
Wenyu Liu
Xinggang Wang
Qi Tian
227
8
0
29 Sep 2025
Pretraining Large Language Models with NVFP4
Nvidia
Felix Abecassis
Anjulie Agrusa
Dong Ahn
Jonah Alben
...
Yujia Zhai
Ruoxi Zhang
Jingyang Zhu
Zhongbo Zhu
Zhongbo Zhu
394
27
0
29 Sep 2025
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
Jintao Zhang
Haoxu Wang
Kai Jiang
Shuo Yang
Kaiwen Zheng
...
Min Zhao
Ion Stoica
Joseph E. Gonzalez
Jun Zhu
Jianfei Chen
233
22
0
28 Sep 2025
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
Deokjae Lee
Hyun Oh Song
MQ
275
0
0
24 Sep 2025
Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
Felipe Oviedo
Fiodar Kazhamiaka
Esha Choukse
Allen Kim
Amy Luers
Melanie Nakagawa
Ricardo Bianchini
J. L. Ferres
200
5
0
24 Sep 2025
Mamba Modulation: On the Length Generalization of Mamba
Peng Lu
Jerry Huang
Qiuhao Zeng
X. Wang
Boxing Wang
Philippe Langlais
Yufei Cui
Mamba
374
0
0
23 Sep 2025
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
Boao Kong
Junzhu Liang
Yuxi Liu
Renjia Deng
Kun Yuan
222
2
0
23 Sep 2025
Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi
Ciaran Cooney
AILaw
VLM
400
2
0
18 Sep 2025
When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning
Mengyi Deng
Xin Li
T. Zhu
Zhicheng YANG
Zhijiang Guo
Wei Wang
185
0
0
16 Sep 2025
1
2
3
Next
Page 1 of 3