v1v2v3 (latest)

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Annual Meeting of the Association for Computational Linguistics (ACL), 2019

24 May 2019

Papers citing "Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark"

42 / 42 papers shown

When Does Meaning Backfire? Investigating the Role of AMRs in NLI

290

17 Jun 2025

TLoRA: Tri-Matrix Low-Rank Adaptation of Large Language Models

Tanvir Islam

AI4CE

334

25 Apr 2025

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

...

600

168

10 Apr 2025

Neuro-Symbolic Contrastive Learning for Cross-domain InferenceInternational Conference on Logic Programming (ICLP), 2025

422

13 Feb 2025

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

490

06 Dec 2024

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

Ekaterina Artemova

314

27 Jun 2024

What Makes Language Models Good-enough?

Daiki Asami

Saku Sugawara

230

06 Jun 2024

A synthetic data approach for domain generalization of NLI models

Mohammad Javad Hosseini

268

19 Feb 2024

The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP

Julian Michael

200

01 Dec 2023

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMsInternational Conference on Machine Learning (ICML), 2023

383

29 Sep 2023

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Luke Zettlemoyer

Madian Khabsa

360

237

31 Aug 2023

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained ModelsInternational Conference on Learning Representations (ICLR), 2023

Xing Xie

337

21 Aug 2023

What's the Meaning of Superhuman Performance in Today's NLU?Annual Meeting of the Association for Computational Linguistics (ACL), 2023

Daniel Hershcovich

...

ELM LM&MA VLM ReLM LRM

297

15 May 2023

Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-Text RationalesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Yejin Choi

Xiang Ren

HAI LRM

218

11 May 2023

A Human Subject Study of Named Entity Recognition (NER) in Conversational Music Recommendation QueriesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Elena V. Epure

Romain Hennequin

136

13 Mar 2023

A Challenging Benchmark for Low-Resource Learning

Yudong Wang

Chang Ma

Qingxiu Dong

Lingpeng Kong

Jingjing Xu

153

07 Mar 2023

RuCoLA: Russian Corpus of Linguistic AcceptabilityConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

258

23 Oct 2022

State-of-the-art generalisation research in NLP: A taxonomy and reviewNature Machine Intelligence (Nat. Mach. Intell.), 2022

Verna Dankers

...

628

131

06 Oct 2022

HumanAL: Calibrating Human Matching Beyond a Single Task

Roee Shraga

HAI

156

06 May 2022

Testing the limits of natural language models for predicting human language judgmentsNature Machine Intelligence (Nat. Mach. Intell.), 2022

Tal Golan

Matthew Siegelman

N. Kriegeskorte

Christopher A. Baldassano

258

07 Apr 2022

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment AnalysisInternational Conference on Language Resources and Evaluation (LREC), 2022

Shamsuddeen Hassan Muhammad

David Ifeoluwa Adelani

...

305

119

20 Jan 2022

The Defeat of the Winograd Schema ChallengeArtificial Intelligence (AIJ), 2022

312

07 Jan 2022

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Tatiana Shavrina

Valentin Malykh

ALM ELM

659

02 Dec 2021

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Ahmed Hassan Awadallah

Yangqiu Song

VLM ELM AAML

291

273

04 Nov 2021

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

Xiaodong Liu

Ahmed Hassan Awadallah

Jianfeng Gao

ELM

157

04 Nov 2021

IndoNLI: A Natural Language Inference Dataset for Indonesian

204

27 Oct 2021

Investigating Transfer Learning in Multilingual Pre-trained Language Models through Chinese Natural Language InferenceFindings (Findings), 2021

167

07 Jun 2021

Comparing Test Sets with Item Response TheoryAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

169

01 Jun 2021

KLUE: Korean Language Understanding Evaluation

...

469

220

20 May 2021

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

138

03 May 2021

Sensitivity as a Complexity Measure for Sequence Classification TasksTransactions of the Association for Computational Linguistics (TACL), 2021

Michael Hahn

Dan Jurafsky

Richard Futrell

322

21 Apr 2021

What Will it Take to Fix Benchmarking in Natural Language Understanding?North American Chapter of the Association for Computational Linguistics (NAACL), 2021

Samuel R. Bowman

George E. Dahl

ELM ALM

268

188

05 Apr 2021

OCNLI: Original Chinese Natural Language Inference

227

129

12 Oct 2020

What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

Yixin Nie

Xiang Zhou

Joey Tianyi Zhou

358

158

07 Oct 2020

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?Annual Meeting of the Association for Computational Linguistics (ACL), 2020

Tal Linzen

480

205

03 May 2020

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual GeneralizationInternational Conference on Machine Learning (ICML), 2020

Graham Neubig

639

1,068

24 Mar 2020

What Does My QA Model Know? Devising Controlled Probes using Expert KnowledgeTransactions of the Association for Computational Linguistics (TACL), 2019

Kyle Richardson

Ashish Sabharwal

243

31 Dec 2019

Learning to Learn Words from Visual Scenes

Heng Ji

Carl Vondrick

186

25 Nov 2019

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performanceBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2019

R. Thomas McCoy

Junghyun Min

Tal Linzen

401

156

07 Nov 2019

A Pragmatics-Centered Evaluation Framework for Natural Language UnderstandingInternational Conference on Language Resources and Evaluation (LREC), 2019

Damien Sileo

134

19 Jul 2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding SystemsNeural Information Processing Systems (NeurIPS), 2019

Amanpreet Singh

663

2,610

02 May 2019

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R. Thomas McCoy

Ellie Pavlick

Tal Linzen

936

1,329

04 Feb 2019