Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.17752
Cited By
Can multiple-choice questions really be useful in detecting the abilities of LLMs?
26 March 2024
Wangyue Li
Liangzhi Li
Tong Xiang
Xiao Liu
Wei Deng
Noa Garcia
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Can multiple-choice questions really be useful in detecting the abilities of LLMs?"
24 / 24 papers shown
Title
How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension
Hao Li
Liuzhenghao Lv
He Cao
Zijing Liu
Zhiyuan Yan
Yu Wang
Yonghong Tian
Y. Li
Li Yuan
27
0
0
10 Apr 2025
Patience is all you need! An agentic system for performing scientific literature review
David Brett
Anniek Myatt
23
0
0
28 Mar 2025
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education
Shrutika Singh
Anton Alyakin
Daniel Alber
Jaden Stryker
Ai Phuong S Tong
...
Mathew de la Paz
Miguel Hernandez-Rovira
Ki Yun Park
Eric Leuthardt
E. Oermann
AI4MH
AI4Ed
ELM
58
0
0
13 Mar 2025
Large Language Models Often Say One Thing and Do Another
Ruoxi Xu
Hongyu Lin
Xianpei Han
Jia Zheng
Weixiang Zhou
Le Sun
Yingfei Sun
42
1
0
10 Mar 2025
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
Jiho Jin
Woosung Kang
Junho Myung
Alice H. Oh
41
0
0
10 Mar 2025
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions
Yizhe Zhang
Richard He Bai
Zijin Gu
Ruixiang Zhang
Jiatao Gu
Emmanuel Abbe
Samy Bengio
Navdeep Jaitly
LRM
BDL
53
1
0
25 Feb 2025
Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations
Yong Cao
Haijiang Liu
Arnav Arora
Isabelle Augenstein
Paul Röttger
Daniel Hershcovich
49
1
0
20 Feb 2025
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Guangxiang Zhao
Saier Hu
Xiaoqi Jian
Jinzhu Wu
Yuhan Wu
Change Jia
Lin Sun
Xiangzheng Zhang
77
0
0
18 Feb 2025
Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
Anna Arias-Duart
Pablo A. Martin-Torres
Daniel Hinjos
Pablo Bernabeu Perez
Lucia Urcelay-Ganzabal
Marta Gonzalez-Mallo
Ashwin Kumar Gururajan
Enrique Lopez-Cuena
Sergio Álvarez Napagao
Dario Garcia-Gasulla
LM&MA
ELM
100
1
0
10 Feb 2025
OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
Chengfeng Zhou
Ji Wang
Juanjuan Qin
Yining Wang
Ling Sun
Weiwei Dai
LM&MA
ELM
86
0
0
03 Feb 2025
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models
Olga Loginova
Oleksandr Bezrukov
Alexey Kravets
18
0
0
18 Oct 2024
In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models
Pengrui Han
Peiyang Song
Haofei Yu
Jiaxuan You
ReLM
LRM
21
1
0
23 Sep 2024
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
Jann Railey Montalan
Jian Gang Ngui
Wei Qi Leong
Yosephine Susanto
Hamsawardhini Rengarajan
William-Chandra Tjhi
Alham Fikri Aji
33
3
0
20 Sep 2024
Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates
Yusuke Sakai
Adam Nohejl
Jiangnan Hang
Hidetaka Kamigaito
Taro Watanabe
ELM
36
2
0
22 Aug 2024
The Better Angels of Machine Personality: How Personality Relates to LLM Safety
Jie M. Zhang
Dongrui Liu
Chao Qian
Ziyue Gan
Yong-jin Liu
Yu Qiao
Jing Shao
LLMAG
PILM
40
12
0
17 Jul 2024
Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?
Leonidas Zotos
H. Rijn
Malvina Nissim
ELM
31
2
0
07 Jul 2024
Social Bias Evaluation for Large Language Models Requires Prompt Variations
Rem Hida
Masahiro Kaneko
Naoaki Okazaki
38
13
0
03 Jul 2024
Autonomous Prompt Engineering in Large Language Models
Daan Kepel
Konstantina Valogianni
LLMAG
35
6
0
25 Jun 2024
OLMES: A Standard for Language Model Evaluations
Yuling Gu
Oyvind Tafjord
Bailey Kuehl
Dany Haddad
Jesse Dodge
Hannaneh Hajishirzi
ELM
32
13
0
12 Jun 2024
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Jiatong Li
Renjun Hu
Kunzhe Huang
Zhuang Yan
Qi Liu
Mengxiao Zhu
Xing Shi
Wei Lin
KELM
36
4
0
30 May 2024
Leveraging Large Language Models for Multiple Choice Question Answering
Joshua Robinson
Christopher Rytting
David Wingate
ELM
138
181
0
22 Oct 2022
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng-Zhen Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
242
1,070
0
05 Oct 2022
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin
Bhuwan Dhingra
Zhengping Liu
William W. Cohen
Xinghua Lu
202
791
0
13 Sep 2019
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
404
2,576
0
03 Sep 2019
1