Our Evaluation Metric Needs an Update to Encourage Generalization

14 July 2020

Papers citing "Our Evaluation Metric Needs an Update to Encourage Generalization"

13 / 13 papers shown

LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity

Anjana Arunkumar

Sanjay Kariyappa

Rakhi Agrawal

Sriramakrishnan Chandrasekaran

Chris Bryan

224

12 Apr 2023

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop WorkflowConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Anjana Arunkumar

Swaroop Mishra

Bhavdeep Singh Sachdeva

Chitta Baral

Chris Bryan

228

09 Feb 2023

Pretrained Transformers Do not Always Improve Robustness

Swaroop Mishra

Bhavdeep Singh Sachdeva

Chitta Baral

VLM

175

14 Oct 2022

A Survey of Parameters Associated with the Quality of Benchmarks in NLP

231

14 Oct 2022

Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

Swaroop Mishra

Anjana Arunkumar

Chitta Baral

159

10 Oct 2022

Don't Blame the Annotator: Bias Already Starts in the Annotation InstructionsConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022

516

01 May 2022

NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Swaroop Mishra

Arindam Mitra

Neeraj Varshney

Bhavdeep Singh Sachdeva

366

137

12 Apr 2022

Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial RobustnessFindings (Findings), 2022

Tejas Gokhale

Swaroop Mishra

Man Luo

Bhavdeep Singh Sachdeva

Chitta Baral

273

15 Mar 2022

Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering

Yingbo Zhou

240

14 Mar 2022

A Proposal to Study "Is High Quality Data All We Need?"

Swaroop Mishra

Anjana Arunkumar

178

12 Mar 2022

Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial SettingsFindings (Findings), 2022

Neeraj Varshney

Swaroop Mishra

Chitta Baral

334

01 Mar 2022

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable EvaluationAAAI Conference on Artificial Intelligence (AAAI), 2021

Swaroop Mishra

Anjana Arunkumar

237

10 Jun 2021

DQI: A Guide to Benchmark Evaluation

Swaroop Mishra

Anjana Arunkumar

Bhavdeep Singh Sachdeva

Chris Bryan

Chitta Baral

185

10 Aug 2020