v1v2v3 (latest)

Measuring Massive Multitask Language Understanding

International Conference on Learning Representations (ICLR), 2020

7 September 2020

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "Measuring Massive Multitask Language Understanding"

50 / 4,486 papers shown

Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

202

21 Oct 2025

How Do LLMs Use Their Depth?

Akshat Gupta

Jay Yeung

Gopala Anumanchipalli

Anna Ivanova

21 Oct 2025

Some Attention is All You Need for Retrieval

Felix Michalak

Steven Abreu

21 Oct 2025

ECG-LLM-- training and evaluation of domain-specific large language models for electrocardiography

Lara Ahrens

Wilhelm Haverkamp

Nils Strodthoff

139

21 Oct 2025

The Free Transformer

François Fleuret

20 Oct 2025

MARS-M: When Variance Reduction Meets Matrices

Yifeng Liu

Angela Yuan

Q. Gu

230

20 Oct 2025

SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone

Nishant Subramani

Alfredo Gomez

Mona T. Diab

129

20 Oct 2025

DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones

202

20 Oct 2025

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

356

20 Oct 2025

Annotation-Efficient Universal Honesty Alignment

158

20 Oct 2025

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Henry Lim

Kwan Hui Lim

LRM

100

20 Oct 2025

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

Manik Rana

Calissa Man

Anotida Expected Msiiwa

20 Oct 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

158

20 Oct 2025

JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs

...

189

20 Oct 2025

Measuring Reasoning in LLMs: a New Dialectical Angle

Soheil Abbasloo

LRM

141

20 Oct 2025

Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

...

197

20 Oct 2025

Mapping Post-Training Forgetting in Language Models at Scale

160

20 Oct 2025

Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

173

19 Oct 2025

Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning

196

19 Oct 2025

SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

...

184

19 Oct 2025

DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

Lanni Bu

Lauren Levin

Amir Zeldes

174

19 Oct 2025

Hierarchical Federated Unlearning for Large Language Models

Yisheng Zhong

Zhengbang Yang

Zhuangdi Zhu

202

19 Oct 2025

Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games

113

19 Oct 2025

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

183

19 Oct 2025

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

Emily Chang

Niyati Bafna

ELM

150

19 Oct 2025

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

578

19 Oct 2025

EditMark: Watermarking Large Language Models based on Model Editing

235

18 Oct 2025

MIN-Merging: Merge the Important Neurons for Model Merging

Yunfei Liang

MoMe

558

18 Oct 2025

When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

165

18 Oct 2025

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

117

17 Oct 2025

Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

171

17 Oct 2025

Expert Merging in Sparse Mixture of Experts with Nash Bargaining

193

17 Oct 2025

MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

...

139

17 Oct 2025

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

112

17 Oct 2025

Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID

117

17 Oct 2025

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

152

17 Oct 2025

HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

17 Oct 2025

SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

Yang Feng

Xudong Pan

AAML

100

17 Oct 2025

Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

Lina Berrayana

Ahmed Heakl

Muhammad Abdullah Sohail

Thomas Hofmann

Salman Khan

Wei Chen

185

17 Oct 2025

KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

154

17 Oct 2025

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

135

17 Oct 2025

Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction

...

Hossein Nourkhiz Mahjoub

Ehsan Moradi-Pari

Kwonjoon Lee

Tianlong Chen

233

16 Oct 2025

Model-agnostic Selective Labeling with Provable Statistical Guarantees

142

16 Oct 2025

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

...

171

16 Oct 2025

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Joosung Lee

123

16 Oct 2025

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

...

334

16 Oct 2025

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

270

16 Oct 2025

Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

152

16 Oct 2025

AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization

158

16 Oct 2025

Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models

227

16 Oct 2025