ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.07654
  4. Cited By
Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question
  Answering Evaluation

Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation

15 February 2022
Jannis Bulian
Christian Buck
Wojciech Gajewski
Benjamin Boerschinger
Tal Schuster
ArXivPDFHTML

Papers citing "Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation"

36 / 36 papers shown
Title
A Study into Investigating Temporal Robustness of LLMs
A Study into Investigating Temporal Robustness of LLMs
Jonas Wallat
Abdelrahman Abdallah
Adam Jatowt
Avishek Anand
42
0
0
21 Mar 2025
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Mingda Zhang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
100
4
0
12 Dec 2024
A Survey of Hallucination in Large Visual Language Models
A Survey of Hallucination in Large Visual Language Models
Wei Lan
Wenyi Chen
Qingfeng Chen
Shirui Pan
Huiyu Zhou
Yi-Lun Pan
LRM
28
4
0
20 Oct 2024
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
25
25
0
04 Jul 2024
Stratified Prediction-Powered Inference for Hybrid Language Model
  Evaluation
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation
Adam Fisch
Joshua Maynez
R. A. Hofer
Bhuwan Dhingra
Amir Globerson
William W. Cohen
34
7
0
06 Jun 2024
Automated Evaluation of Retrieval-Augmented Language Models with
  Task-Specific Exam Generation
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation
Gauthier Guinet
Behrooz Omidvar-Tehrani
Anoop Deoras
Laurent Callot
RALM
62
16
0
22 May 2024
Bayesian Prediction-Powered Inference
Bayesian Prediction-Powered Inference
R. A. Hofer
Joshua Maynez
Bhuwan Dhingra
Adam Fisch
Amir Globerson
William W. Cohen
16
2
0
09 May 2024
Studying Large Language Model Behaviors Under Realistic Knowledge
  Conflicts
Studying Large Language Model Behaviors Under Realistic Knowledge Conflicts
Evgenii Kortukov
Alexander Rubinstein
Elisa Nguyen
Seong Joon Oh
RALM
426
5
2
24 Apr 2024
Open-ended VQA benchmarking of Vision-Language models by exploiting
  Classification datasets and their semantic hierarchy
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy
Simon Ging
M. A. Bravo
Thomas Brox
VLM
38
11
0
11 Feb 2024
Temporal Blind Spots in Large Language Models
Temporal Blind Spots in Large Language Models
Jonas Wallat
Adam Jatowt
Avishek Anand
31
3
0
22 Jan 2024
Cross-modal Retrieval for Knowledge-based Visual Question Answering
Cross-modal Retrieval for Knowledge-based Visual Question Answering
Paul Lerner
Olivier Ferret
C. Guinaudeau
28
7
0
11 Jan 2024
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction
  Data
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
Qifan Yu
Juncheng Li
Longhui Wei
Liang Pang
Wentao Ye
Bosheng Qin
Siliang Tang
Qi Tian
Yueting Zhuang
MLLM
VLM
25
67
0
22 Nov 2023
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language
  Models
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models
Yujin Kim
Jaehong Yoon
Seonghyeon Ye
Sangmin Bae
Namgyu Ho
Sung Ju Hwang
Se-Young Yun
KELM
17
9
0
14 Nov 2023
How (not) to ensemble LVLMs for VQA
How (not) to ensemble LVLMs for VQA
Lisa Alazraki
Lluis Castrejon
Mostafa Dehghani
Fantine Huot
J. Uijlings
Thomas Mensink
22
3
0
10 Oct 2023
Improving Automatic VQA Evaluation Using Large Language Models
Improving Automatic VQA Evaluation Using Large Language Models
Oscar Manas
Benno Krojer
Aishwarya Agrawal
14
21
0
04 Oct 2023
SQUARE: Automatic Question Answering Evaluation using Multiple Positive
  and Negative References
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References
Matteo Gabburo
Siddhant Garg
Rik Koncel-Kedziorski
Alessandro Moschitti
17
1
0
21 Sep 2023
Evaluating Correctness and Faithfulness of Instruction-Following Models
  for Question Answering
Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
Vaibhav Adlakha
Parishad BehnamGhader
Xing Han Lù
Nicholas Meade
Siva Reddy
25
118
0
31 Jul 2023
An Overview Of Temporal Commonsense Reasoning and Acquisition
An Overview Of Temporal Commonsense Reasoning and Acquisition
Georg Wenzel
Adam Jatowt
ReLM
LRM
16
8
0
28 Jul 2023
Encyclopedic VQA: Visual questions about detailed properties of
  fine-grained categories
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories
Thomas Mensink
J. Uijlings
Lluis Castrejon
A. Goel
Felipe Cadar
Howard Zhou
Fei Sha
A. Araújo
V. Ferrari
31
36
0
15 Jun 2023
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT
  and GPT-4 for Mining Insights at Scale
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Mining Insights at Scale
Jonas Oppenlaender
Joonas Hamalainen
10
6
0
08 Jun 2023
LAIT: Efficient Multi-Segment Encoding in Transformers with
  Layer-Adjustable Interaction
LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction
Jeremiah Milbauer
Annie Louis
Mohammad Javad Hosseini
Alex Fabrikant
Donald Metzler
Tal Schuster
16
9
0
31 May 2023
Learning Answer Generation using Supervision from Automatic Question
  Answering Evaluators
Learning Answer Generation using Supervision from Automatic Question Answering Evaluators
Matteo Gabburo
Siddhant Garg
Rik Koncel-Kedziorski
Alessandro Moschitti
11
6
0
24 May 2023
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
  Large Language Models
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models
Natalie Shapira
Mosh Levy
S. Alavi
Xuhui Zhou
Yejin Choi
Yoav Goldberg
Maarten Sap
Vered Shwartz
LLMAG
ELM
13
113
0
24 May 2023
Evaluating and Modeling Attribution for Cross-Lingual Question Answering
Evaluating and Modeling Attribution for Cross-Lingual Question Answering
Benjamin Muller
John Wieting
J. Clark
Tom Kwiatkowski
Sebastian Ruder
Livio Baldini Soares
Roee Aharoni
Jonathan Herzig
Xinyi Wang
HILM
19
16
0
23 May 2023
Evaluating Open-Domain Question Answering in the Era of Large Language
  Models
Evaluating Open-Domain Question Answering in the Era of Large Language Models
Ehsan Kamalloo
Nouha Dziri
C. Clarke
Davood Rafiei
ELM
14
98
0
11 May 2023
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
  Synthetic and Compositional Images
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Nitzan Bitton-Guetta
Yonatan Bitton
Jack Hessel
Ludwig Schmidt
Yuval Elovici
Gabriel Stanovsky
Roy Schwartz
VLM
121
65
0
13 Mar 2023
DIFFQG: Generating Questions to Summarize Factual Changes
DIFFQG: Generating Questions to Summarize Factual Changes
Jeremy R. Cole
Palak Jain
Julian Martin Eisenschlos
Michael J.Q. Zhang
Eunsol Choi
Bhuwan Dhingra
KELM
19
3
0
01 Mar 2023
Attributed Question Answering: Evaluation and Modeling for Attributed
  Large Language Models
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Bernd Bohnet
Vinh Q. Tran
Pat Verga
Roee Aharoni
D. Andor
...
Michael Collins
Dipanjan Das
Donald Metzler
Slav Petrov
Kellie Webster
36
59
0
15 Dec 2022
Discord Questions: A Computational Approach To Diversity Analysis in
  News Coverage
Discord Questions: A Computational Approach To Diversity Analysis in News Coverage
Philippe Laban
Chien-Sheng Wu
Lidiya Murakhovs'ka
Xiang Ánthony' Chen
Caiming Xiong
8
12
0
09 Nov 2022
Evaluation of Semantic Answer Similarity Metrics
Evaluation of Semantic Answer Similarity Metrics
Farida Mustafazade
Peter F. Ebbinghaus
6
2
0
25 Jun 2022
The Unreliability of Explanations in Few-shot Prompting for Textual
  Reasoning
The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning
Xi Ye
Greg Durrett
ReLM
LRM
20
167
0
06 May 2022
Stretching Sentence-pair NLI Models to Reason over Long Documents and
  Clusters
Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters
Tal Schuster
Sihao Chen
S. Buthpitiya
Alex Fabrikant
Donald Metzler
13
41
0
15 Apr 2022
What's in a Name? Answer Equivalence For Open-Domain Question Answering
What's in a Name? Answer Equivalence For Open-Domain Question Answering
Chenglei Si
Chen Zhao
Jordan L. Boyd-Graber
151
35
0
11 Sep 2021
Consistent Accelerated Inference via Confident Adaptive Transformers
Consistent Accelerated Inference via Confident Adaptive Transformers
Tal Schuster
Adam Fisch
Tommi Jaakkola
Regina Barzilay
AI4TS
179
69
0
18 Apr 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
243
284
0
02 Feb 2021
Distribution-Free, Risk-Controlling Prediction Sets
Distribution-Free, Risk-Controlling Prediction Sets
Stephen Bates
Anastasios Nikolas Angelopoulos
Lihua Lei
Jitendra Malik
Michael I. Jordan
OOD
176
184
0
07 Jan 2021
1