Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.14135
Cited By
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
27 May 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
50 / 1,418 papers shown
Title
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
HamidReza Imani
Jiaxin Peng
Peiman Mohseni
Abdolah Amirany
Tarek A. El-Ghazawi
MoE
19
0
0
10 May 2025
Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models
Isaac Gerber
17
0
0
10 May 2025
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
Haolin Zhang
Jeff Huang
19
0
0
09 May 2025
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization
Seunghee Han
S. Choi
J. Kim
14
0
0
09 May 2025
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
Jiahao Li
Weijian Ma
Xueyang Li
Yunzhong Lou
G. Zhou
Xiangdong Zhou
32
0
0
07 May 2025
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Xueyao Zhang
Y. Wang
Chaoren Wang
Z. Li
Zhuo Chen
Zhizheng Wu
37
0
0
07 May 2025
MFSeg: Efficient Multi-frame 3D Semantic Segmentation
Chengjie Huang
Krzysztof Czarnecki
3DPC
37
0
0
07 May 2025
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
Xueguang Ma
Luyu Gao
Shengyao Zhuang
Jiaqi Samantha Zhan
Jamie Callan
Jimmy Lin
36
0
0
05 May 2025
SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting
Shiwei Guo
Z. Chen
Yupeng Ma
Yunfei Han
Yi Wang
AI4TS
41
0
0
05 May 2025
Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution
Xingyu Zhou
Wei Long
Jingbo Lu
Shiyin Jiang
Weiyi You
Haifeng Wu
Shuhang Gu
28
0
0
04 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
S. Zhang
Y. Hu
Zining Liu
MoE
27
0
0
02 May 2025
Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation
Jianxing Qin
Jingrong Chen
Xinhao Kong
Yongji Wu
Liang Luo
Z. Wang
Ying Zhang
Tingjun Chen
Alvin R. Lebeck
Danyang Zhuo
27
0
0
02 May 2025
BiGSCoder: State Space Model for Code Understanding
Shweta Verma
Abhinav Anand
Mira Mezini
Mamba
31
0
0
02 May 2025
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook
Muyi Bao
Shuchang Lyu
Zhaoyang Xu
Huiyu Zhou
Jinchang Ren
Shiming Xiang
X. Li
Guangliang Cheng
Mamba
72
0
0
01 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoE
VLM
88
0
0
01 May 2025
Scaling On-Device GPU Inference for Large Generative Models
Jiuqiang Tang
Raman Sarokin
Ekaterina Ignasheva
Grant Jensen
Lin Chen
Juhyun Lee
Andrei Kulik
Matthias Grundmann
31
0
0
01 May 2025
GPU Performance Portability needs Autotuning
Burkhard Ringlein
Thomas Parnell
Radu Stoica
39
0
0
30 Apr 2025
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Azam Ikram
Xiang Li
Sameh Elnikety
S. Bagchi
47
0
0
29 Apr 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
19
0
0
29 Apr 2025
Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection
Ziqing Fan
Siyuan Du
Shengchao Hu
Pingjie Wang
Li Shen
Y. Zhang
Dacheng Tao
Y. Wang
41
1
0
29 Apr 2025
A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models
Kohei Saijo
Tetsuji Ogawa
39
0
0
28 Apr 2025
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Zhenyu (Allen) Zhang
Zechun Liu
Yuandong Tian
Harshit Khaitan
Z. Wang
Steven Li
54
0
0
28 Apr 2025
Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling
Ishan Kavathekar
Raghav Donakanti
Ponnurangam Kumaraguru
Karthik Vaidhyanathan
48
0
0
27 Apr 2025
A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1
Mingda Zhang
Jianglong Qin
MedIm
37
0
0
25 Apr 2025
Embedding Empirical Distributions for Computing Optimal Transport Maps
Mingchen Jiang
Peng Xu
Xichen Ye
Xiaohui Chen
Yun Yang
Yifan Chen
OT
39
0
0
24 Apr 2025
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Wei Zhang
Zhiyu Wu
Yi Mu
Banruo Liu
Myungjin Lee
Fan Lai
51
0
0
24 Apr 2025
An Empirical Study on Prompt Compression for Large Language Models
Z. Zhang
Jinyi Li
Yihuai Lan
X. Wang
Hao Wang
MQ
39
0
0
24 Apr 2025
TileLang: A Composable Tiled Programming Model for AI Systems
Lei Wang
Yu Cheng
Yining Shi
Zhengju Tang
Zhiwen Mo
...
Lingxiao Ma
Yuqing Xia
Jilong Xue
Fan Yang
Z. Yang
49
1
0
24 Apr 2025
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
Qingyuan Liu
Liyan Chen
Yanning Yang
H. Wang
Dong Du
Zhigang Mao
Naifeng Jing
Yubin Xia
Haibo Chen
24
0
0
24 Apr 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Z. Liu
Dong Li
E. Barsoum
51
0
0
23 Apr 2025
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Ali Hassani
Fengzhe Zhou
Aditya Kane
Jiannan Huang
Chieh-Yun Chen
...
Bing Xu
Haicheng Wu
Wen-mei W. Hwu
Ming-Yu Liu
Humphrey Shi
24
0
0
23 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
81
0
0
22 Apr 2025
Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
X. Zhang
Yaoyao Ding
Yang Hu
Gennady Pekhimenko
36
0
0
22 Apr 2025
llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length
Issa Sugiura
Kouta Nakayama
Yusuke Oda
24
0
0
22 Apr 2025
Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity
Cong Qi
Hanzhang Fang
Tianxing Hu
Siqi Jiang
Wei Zhi
Mamba
53
0
0
22 Apr 2025
COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
Ye Qiao
Zhiheng Cheng
Yian Wang
Yifan Zhang
Yunzhe Deng
Sitao Huang
75
0
0
22 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matt Morse
Raghavv Goel
Mingu Lee
Chris Lott
19
0
0
21 Apr 2025
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
37
0
0
21 Apr 2025
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training
Z. Li
Y. Liu
W. Zhang
Tailing Yuan
Bin Chen
Chengru Song
Di Zhang
27
0
0
20 Apr 2025
Learning to Attribute with Attention
Benjamin Cohen-Wang
Yung-Sung Chuang
Aleksander Madry
20
0
0
18 Apr 2025
Antidistillation Sampling
Yash Savani
Asher Trockman
Zhili Feng
Avi Schwarzschild
Alexander Robey
Marc Finzi
J. Zico Kolter
41
0
0
17 Apr 2025
Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
Chaoyue Niu
Yucheng Ding
Junhui Lu
Zhengxiang Huang
Hang Zeng
Yutong Dai
Xuezhen Tu
Chengfei Lv
Fan Wu
Guihai Chen
27
0
0
17 Apr 2025
A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions
Rahima Khanam
Muhammad Hussain
29
0
0
16 Apr 2025
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
Junyang Zhang
Tianyi Zhu
Cheng Luo
A. Anandkumar
RALM
42
0
0
16 Apr 2025
Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs
Hyungwoo Lee
Kihyun Kim
Jinwoo Kim
Jungmin So
Myung-Hoon Cha
H. Kim
James J. Kim
Youngjae Kim
27
0
0
16 Apr 2025
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
Y. Wang
J. Li
Chaoyi Hong
Ruibo Li
Liusheng Sun
Xiao-yang Song
Zhe Wang
Zhiguo Cao
Guosheng Lin
MDE
24
0
0
16 Apr 2025
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers
Run Wang
Gamze Islamoglu
Andrea Belano
Viviane Potocnik
Francesco Conti
Angelo Garofalo
Luca Benini
21
0
0
15 Apr 2025
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Yangshen Deng
Zhengxin You
Long Xiang
Qilong Li
Peiqi Yuan
...
Man Lung Yiu
Huan Li
Qiaomu Shen
Rui Mao
Bo Tang
31
0
0
14 Apr 2025
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
Théo Gigant
Camille Guinaudeau
Frédéric Dufaux
19
0
0
14 Apr 2025
Understanding and Optimizing Multi-Stage AI Inference Pipelines
A. Bambhaniya
Hanjiang Wu
Suvinay Subramanian
S. Srinivasan
Souvik Kundu
Amir Yazdanbakhsh
Midhilesh Elavazhagan
Madhu Kumar
Tushar Krishna
37
0
0
14 Apr 2025
1
2
3
4
...
27
28
29
Next