ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.05202
  4. Cited By
GLU Variants Improve Transformer

GLU Variants Improve Transformer

12 February 2020
Noam M. Shazeer
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Yuxin Zhou
Zheng Li
Junxuan Zhang
Jue Wang
Yanjie Wang
Zhongle Xie
Ke Chen
Lidan Shou
MoE
443
3
0
09 May 2025
Faster MoE LLM Inference for Extremely Large Models
Faster MoE LLM Inference for Extremely Large Models
Haoqi Yang
Luohe Shi
Qiwei Li
Zuchao Li
Ping Wang
Bo Du
Mengjia Shen
Hai Zhao
MoE
246
3
0
06 May 2025
SPAP: Structured Pruning via Alternating Optimization and Penalty Methods
SPAP: Structured Pruning via Alternating Optimization and Penalty Methods
Hanyu Hu
Xiaoming Yuan
218
1
0
06 May 2025
Bielik 11B v2 Technical Report
Bielik 11B v2 Technical Report
Krzysztof Ociepa
Łukasz Flis
Krzysztof Wróbel
Adrian Gwoździej
Remigiusz Kinas
403
0
0
05 May 2025
Bielik v3 Small: Technical Report
Bielik v3 Small: Technical Report
Krzysztof Ociepa
Łukasz Flis
Remigiusz Kinas
Krzysztof Wróbel
Adrian Gwoździej
382
1
0
05 May 2025
Parameter-Efficient Transformer Embeddings
Parameter-Efficient Transformer Embeddings
Henry Ndubuaku
Mouad Talhi
270
0
0
04 May 2025
Blockbuster, Part 1: Block-level AI Operator Fusion
Blockbuster, Part 1: Block-level AI Operator Fusion
Ofer Dekel
158
1
0
29 Apr 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
272
1
0
29 Apr 2025
CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design
CasaGPT: Cuboid Arrangement and Scene Assembly for Interior DesignComputer Vision and Pattern Recognition (CVPR), 2025
Weitao Feng
Hang Zhou
Jing Liao
Li Cheng
Wenbo Zhou
3DV
269
3
0
28 Apr 2025
Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities
Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities
Xi Fu
Wei-Bang Jiang
Yi Ding
Cuntai Guan
416
2
0
28 Apr 2025
A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models
A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models
Kohei Saijo
Tetsuji Ogawa
295
3
0
28 Apr 2025
Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements
Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements
Sandipan Dhar
N. D. Jana
Swagatam Das
271
4
0
27 Apr 2025
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Hongyu Wang
Shuming Ma
Furu Wei
MQ
259
9
0
25 Apr 2025
SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations
SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse ObservationsInternational Conference on Multimedia Retrieval (ICMR), 2025
Shuting Zhao
Linxin Bai
Liangjing Shao
Ye Zhang
Xinrong Chen
269
0
0
25 Apr 2025
Lightweight Latent Verifiers for Efficient Meta-Generation Strategies
Lightweight Latent Verifiers for Efficient Meta-Generation Strategies
Bartosz Piotrowski
Witold Drzewakowski
Konrad Staniszewski
Piotr Miłoś
LRM
274
0
0
23 Apr 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
399
5
0
23 Apr 2025
GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning
GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning
Luu Quy Tung
Hoang Quoc Viet
Pham Bao Loc
Vo Trong Thu
LRM
271
0
0
23 Apr 2025
Trillion 7B Technical Report
Trillion 7B Technical Report
Sungjun Han
Juyoung Suk
Suyeong An
Hyungguk Kim
Kyuseok Kim
Wonsuk Yang
Seungtaek Choi
Jamin Shin
877
4
0
21 Apr 2025
Natural Fingerprints of Large Language Models
Natural Fingerprints of Large Language Models
Teppei Suzuki
Ryokan Ri
Sho Takase
306
2
0
21 Apr 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Kuwain 1.5B: An Arabic SLM via Language Injection
Khalil Hennara
Sara Chrouf
Mohamed Motaism Hamed
Zeina Aldallal
Omar Hadid
Safwan AlModhayan
282
3
0
21 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matthew J Morse
Raghavv Goel
Mingu Lee
Chris Lott
357
11
0
21 Apr 2025
Approximation Rates in Besov Norms and Sample-Complexity of Kolmogorov-Arnold Networks with Residual Connections
Approximation Rates in Besov Norms and Sample-Complexity of Kolmogorov-Arnold Networks with Residual Connections
Anastasis Kratsios
Bum Jun Kim
Takashi Furuya
322
1
0
21 Apr 2025
The Geometry of Self-Verification in a Task-Specific Reasoning Model
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Andrew Lee
Lihao Sun
Chris Wendler
Fernanda Viégas
Martin Wattenberg
LRM
423
3
0
19 Apr 2025
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Ashwinee Panda
Vatsal Baherwani
Zain Sarwar
Benjamin Thérien
Supriyo Chakraborty
Tom Goldstein
Supriyo Chakraborty
MoE
407
2
0
16 Apr 2025
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
Hypergraph Vision Transformers: Images are More than Nodes, More than EdgesComputer Vision and Pattern Recognition (CVPR), 2025
Joshua Fixelle
ViT
256
8
0
11 Apr 2025
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
Wissam Antoun
B. Sagot
Djamé Seddah
MQ
567
5
0
11 Apr 2025
On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition
On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition
Adrian Cosma
Andy Catruna
Emilian Radoi
316
1
0
10 Apr 2025
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Rosie Zhao
Alexandru Meterez
Sham Kakade
Cengiz Pehlevan
Samy Jelassi
Eran Malach
ReLMLRM
860
69
0
10 Apr 2025
A Novel Mamba-based Sequential Recommendation Method
A Novel Mamba-based Sequential Recommendation Method
Jun Yuan
Mamba
858
0
0
10 Apr 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Robert Bamler
603
168
0
10 Apr 2025
Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Pedro Hermosilla
Christian Stippel
Leon Sick
SSL3DPC
400
0
0
09 Apr 2025
Foundation Models for Time Series: A Survey
Foundation Models for Time Series: A Survey
Siva Rama Krishna Kottapalli
Karthik Hubli
Sandeep Chandrashekhara
Garima Jain
Sunayana Hubli
Gayathri Botla
Ramesh Doddaiah
AI4TSAI4CE
411
6
0
05 Apr 2025
Clinical ModernBERT: An efficient and long context encoder for biomedical text
Clinical ModernBERT: An efficient and long context encoder for biomedical text
Simon A. Lee
Anthony Wu
Jeffrey N. Chiang
MedIm
207
19
0
04 Apr 2025
Compositionality Unlocks Deep Interpretable Models
Compositionality Unlocks Deep Interpretable Models
Thomas Dooms
Ward Gauderis
Geraint A. Wiggins
José Oramas
FAttCoGeAI4CE
222
2
0
03 Apr 2025
Multi-Token Attention
Multi-Token Attention
O. Yu. Golovneva
Tianlu Wang
Jason Weston
Sainbayar Sukhbaatar
341
3
0
01 Apr 2025
TRA: Better Length Generalisation with Threshold Relative Attention
TRA: Better Length Generalisation with Threshold Relative Attention
Mattia Opper
Roland Fernandez
P. Smolensky
Jianfeng Gao
547
1
0
29 Mar 2025
GmNet: Revisiting Gating Mechanisms From A Frequency View
GmNet: Revisiting Gating Mechanisms From A Frequency View
Yifan Wang
Xu Ma
Yitian Zhang
Zhongruo Wang
Sung-Cheol Kim
Vahid Mirjalili
Vidya Renganathan
Y. Fu
344
0
0
28 Mar 2025
Named Entity Recognition in Context
Named Entity Recognition in Context
Colin Brisson
Ayoub Kahfy
Marc Bui
Frédéric Constant
314
0
0
26 Mar 2025
Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo
Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo
Zeno Schätzle
P. Szabó
Alice Cuzzocrea
Frank Noé
258
3
0
25 Mar 2025
IgCraft: A versatile sequence generation framework for antibody discovery and engineering
IgCraft: A versatile sequence generation framework for antibody discovery and engineering
Matthew Greenig
Haowen Zhao
Vladimir Radenkovic
Aubin Ramon
Pietro Sormanni
355
4
0
25 Mar 2025
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-OptimizationEuropean Conference on Computer Systems (EuroSys), 2025
Zhanda Zhu
Christina Giannoula
Muralidhar Andoorveedu
Qidong Su
Karttikeya Mangalam
Bojian Zheng
Gennady Pekhimenko
VLMMoE
225
6
0
24 Mar 2025
Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters
Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA AdaptersInternational Conference on Learning Representations (ICLR), 2025
Roberto Garcia
Jerry Liu
Daniel Sorvisto
Sabri Eyuboglu
471
0
0
23 Mar 2025
Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning
Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning
Xiang Fang
Shanghang Zhang
Hao Zhang
Tao Lu
Huabing Zhou
Jiayi Ma
Mamba
311
0
0
23 Mar 2025
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens
Panpan Wang
Liqiang Niu
Fandong Meng
Jinan Xu
Yufeng Chen
Jie Zhou
DiffM
312
0
0
21 Mar 2025
Variance Control via Weight Rescaling in LLM Pre-training
Variance Control via Weight Rescaling in LLM Pre-training
Louis Owen
Abhay Kumar
Nilabhra Roy Chowdhury
Fabian Güra
235
0
0
21 Mar 2025
TRACE: Time SeRies PArameter EffiCient FinE-tuning
TRACE: Time SeRies PArameter EffiCient FinE-tuningNeurocomputing (Neurocomputing), 2025
Yuze Li
Wei Zhu
AI4TS
460
2
0
21 Mar 2025
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-DistillationComputer Vision and Pattern Recognition (CVPR), 2025
Andrea Maracani
Savas Ozkan
Sijun Cho
Hyowon Kim
Eunchung Noh
Jeongwon Min
Cho Jung Min
Dookun Park
Mete Ozay
407
1
0
20 Mar 2025
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
Daniel Haziza
Timothy Chou
Dhruv Choudhary
Luca Wehrstedt
Francisco Massa
Jiecao Yu
Geonhwa Jeong
Supriya Rao
Patrick Labatut
Jesse Cai
375
6
0
20 Mar 2025
Gene42: Long-Range Genomic Foundation Model With Dense Attention
Gene42: Long-Range Genomic Foundation Model With Dense Attention
Kirill Vishniakov
Boulbaba Ben Amor
Engin Tekin
Nancy A. ElNaker
Karthik Viswanathan
...
Tiago Magalhaes
Natalia Vassilieva
Dwarikanath Mahapatra
Marco Pimentel
and Shadab Khan
3DV
246
1
0
20 Mar 2025
Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling
Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling
Shuqi Lu
Haowei Lin
Lin Yao
Zhifeng Gao
Xiaohong Ji
Weinan E
Weinan E
Linfeng Zhang
Guolin Ke
354
2
0
20 Mar 2025
Previous
123...567...171819
Next