Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2002.05202
Cited By

GLU Variants Improve Transformer

GLU Variants Improve Transformer

12 February 2020

Noam M. Shazeer

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "GLU Variants Improve Transformer"

50 / 904 papers shown

FloE: On-the-Fly MoE Inference on Memory-constrained GPU

FloE: On-the-Fly MoE Inference on Memory-constrained GPU

443

3

0

09 May 2025

Faster MoE LLM Inference for Extremely Large Models

Faster MoE LLM Inference for Extremely Large Models

246

3

0

06 May 2025

SPAP: Structured Pruning via Alternating Optimization and Penalty Methods

SPAP: Structured Pruning via Alternating Optimization and Penalty Methods

218

1

0

06 May 2025

Bielik 11B v2 Technical Report

Bielik 11B v2 Technical Report

Krzysztof Ociepa

Krzysztof Wróbel

Adrian Gwoździej

Remigiusz Kinas

403

0

0

05 May 2025

Bielik v3 Small: Technical Report

Bielik v3 Small: Technical Report

Krzysztof Ociepa

Remigiusz Kinas

Krzysztof Wróbel

Adrian Gwoździej

382

1

0

05 May 2025

Parameter-Efficient Transformer Embeddings

Parameter-Efficient Transformer Embeddings

270

0

0

04 May 2025

Blockbuster, Part 1: Block-level AI Operator Fusion

Blockbuster, Part 1: Block-level AI Operator Fusion

158

1

0

29 Apr 2025

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Zayd Muhammad Kawakibi Zuhri

Erland Hilman Fuadi

Alham Fikri Aji

272

1

0

29 Apr 2025

CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design

CasaGPT: Cuboid Arrangement and Scene Assembly for Interior DesignComputer Vision and Pattern Recognition (CVPR), 2025

269

3

0

28 Apr 2025

Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities

Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities

416

2

0

28 Apr 2025

A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models

A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models

295

3

0

28 Apr 2025

Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements

Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements

271

4

0

27 Apr 2025

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

259

9

0

25 Apr 2025

SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse ObservationsInternational Conference on Multimedia Retrieval (ICMR), 2025

269

0

0

25 Apr 2025

Lightweight Latent Verifiers for Efficient Meta-Generation Strategies

Lightweight Latent Verifiers for Efficient Meta-Generation Strategies

Bartosz Piotrowski

Witold Drzewakowski

Konrad Staniszewski

274

0

0

23 Apr 2025

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

...

399

5

0

23 Apr 2025

GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning

GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning

Hoang Quoc Viet

271

0

0

23 Apr 2025

Trillion 7B Technical Report

Trillion 7B Technical Report

877

4

0

21 Apr 2025

Natural Fingerprints of Large Language Models

Natural Fingerprints of Large Language Models

306

2

0

21 Apr 2025

Kuwain 1.5B: An Arabic SLM via Language Injection

Kuwain 1.5B: An Arabic SLM via Language Injection

Mohamed Motaism Hamed

Safwan AlModhayan

282

3

0

21 Apr 2025

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Matthew J Morse

357

11

0

21 Apr 2025

Approximation Rates in Besov Norms and Sample-Complexity of Kolmogorov-Arnold Networks with Residual Connections

Approximation Rates in Besov Norms and Sample-Complexity of Kolmogorov-Arnold Networks with Residual Connections

Anastasis Kratsios

322

1

0

21 Apr 2025

The Geometry of Self-Verification in a Task-Specific Reasoning Model

The Geometry of Self-Verification in a Task-Specific Reasoning Model

Fernanda Viégas

Martin Wattenberg

423

3

0

19 Apr 2025

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

Vatsal Baherwani

Benjamin Thérien

Supriyo Chakraborty

Supriyo Chakraborty

407

2

0

16 Apr 2025

Hypergraph Vision Transformers: Images are More than Nodes, More than Edges

Hypergraph Vision Transformers: Images are More than Nodes, More than EdgesComputer Vision and Pattern Recognition (CVPR), 2025

256

8

0

11 Apr 2025

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

567

5

0

11 Apr 2025

On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition

On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition

316

1

0

10 Apr 2025

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Alexandru Meterez

Cengiz Pehlevan

860

69

0

10 Apr 2025

A Novel Mamba-based Sequential Recommendation Method

A Novel Mamba-based Sequential Recommendation Method

858

0

0

10 Apr 2025

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

...

Rafael Mosquera

Bhargavi Paranjape

603

168

0

10 Apr 2025

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

Pedro Hermosilla

Christian Stippel

400

0

0

09 Apr 2025

Foundation Models for Time Series: A Survey

Foundation Models for Time Series: A Survey

Siva Rama Krishna Kottapalli

Sandeep Chandrashekhara

Ramesh Doddaiah

411

6

0

05 Apr 2025

Clinical ModernBERT: An efficient and long context encoder for biomedical text

Clinical ModernBERT: An efficient and long context encoder for biomedical text

Jeffrey N. Chiang

207

19

0

04 Apr 2025

Compositionality Unlocks Deep Interpretable Models

Compositionality Unlocks Deep Interpretable Models

Geraint A. Wiggins

FAtt CoGe AI4CE

222

2

0

03 Apr 2025

Multi-Token Attention

Multi-Token Attention

O. Yu. Golovneva

Sainbayar Sukhbaatar

341

3

0

01 Apr 2025

TRA: Better Length Generalisation with Threshold Relative Attention

TRA: Better Length Generalisation with Threshold Relative Attention

Roland Fernandez

547

1

0

29 Mar 2025

GmNet: Revisiting Gating Mechanisms From A Frequency View

GmNet: Revisiting Gating Mechanisms From A Frequency View

Vahid Mirjalili

Vidya Renganathan

344

0

0

28 Mar 2025

Named Entity Recognition in Context

Named Entity Recognition in Context

Frédéric Constant

314

0

0

26 Mar 2025

Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo

Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo

Alice Cuzzocrea

258

3

0

25 Mar 2025

IgCraft: A versatile sequence generation framework for antibody discovery and engineering

IgCraft: A versatile sequence generation framework for antibody discovery and engineering

Matthew Greenig

Vladimir Radenkovic

Pietro Sormanni

355

4

0

25 Mar 2025

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-OptimizationEuropean Conference on Computer Systems (EuroSys), 2025

Christina Giannoula

Muralidhar Andoorveedu

Karttikeya Mangalam

Gennady Pekhimenko

225

6

0

24 Mar 2025

Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA AdaptersInternational Conference on Learning Representations (ICLR), 2025

Daniel Sorvisto

471

0

0

23 Mar 2025

Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning

Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning

Shanghang Zhang

311

0

0

23 Mar 2025

D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens

D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens

312

0

0

21 Mar 2025

Variance Control via Weight Rescaling in LLM Pre-training

Variance Control via Weight Rescaling in LLM Pre-training

Nilabhra Roy Chowdhury

235

0

0

21 Mar 2025

TRACE: Time SeRies PArameter EffiCient FinE-tuning

TRACE: Time SeRies PArameter EffiCient FinE-tuningNeurocomputing (Neurocomputing), 2025

460

2

0

21 Mar 2025

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-DistillationComputer Vision and Pattern Recognition (CVPR), 2025

Andrea Maracani

407

1

0

20 Mar 2025

Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

Dhruv Choudhary

Francisco Massa

Patrick Labatut

375

6

0

20 Mar 2025

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Kirill Vishniakov

Boulbaba Ben Amor

Nancy A. ElNaker

Karthik Viswanathan

...

Tiago Magalhaes

Natalia Vassilieva

Dwarikanath Mahapatra

and Shadab Khan

246

1

0

20 Mar 2025

Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

354

2

0

20 Mar 2025

1 2 3...5 6 7...17 18 19