Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.16711
Cited By
Benchmarks as Microscopes: A Call for Model Metrology
22 July 2024
Michael Stephen Saxon
Ari Holtzman
Peter West
William Yang Wang
Naomi Saphra
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Benchmarks as Microscopes: A Call for Model Metrology"
18 / 18 papers shown
Title
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma
Jing Ding
Xuejun Zhang
Dezhi Luo
Jiahe Ding
Sihan Xu
Yuchen Huang
Run Peng
Joyce Chai
49
0
0
22 Apr 2025
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
31
0
0
09 Apr 2025
Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation
Hannah Murray
Brian Hyeongseok Kim
Isabelle G. Lee
Jason Byun
Dani Yogatama
Evi Micha
18
0
0
29 Mar 2025
EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
Zhiyuan Zeng
Yizhong Wang
Hannaneh Hajishirzi
Pang Wei Koh
ELM
53
3
0
11 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
62
3
0
07 Mar 2025
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
49
1
0
26 Oct 2024
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
Zheyuan Zhang
Fengyuan Hu
Jayjun Lee
Freda Shi
Parisa Kordjamshidi
Joyce Chai
Ziqiao Ma
51
11
0
22 Oct 2024
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
Aditya Sharma
Michael Saxon
William Yang Wang
VLM
28
2
0
24 Jun 2024
Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts
Michael Stephen Saxon
Yiran Luo
Sharon Levy
Chitta Baral
Yezhou Yang
William Yang Wang
EGVM
25
3
0
17 Mar 2024
Unsupervised Evaluation of Code LLMs with Round-Trip Correctness
Miltiadis Allamanis
Sheena Panthaplackel
Pengcheng Yin
ALM
OffRL
LRM
43
9
0
13 Feb 2024
The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate
Juhyun Oh
Eunsu Kim
Inha Cha
Alice H. Oh
ELM
28
7
0
09 Feb 2024
PromptBench: A Unified Library for Evaluation of Large Language Models
Kaijie Zhu
Qinlin Zhao
Hao Chen
Jindong Wang
Xing Xie
ELM
51
18
0
13 Dec 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
218
2,232
0
22 Mar 2023
The Debate Over Understanding in AI's Large Language Models
Melanie Mitchell
D. Krakauer
ELM
70
196
0
14 Oct 2022
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
229
961
0
17 Apr 2021
Towards Ecologically Valid Research on Language User Interfaces
H. D. Vries
Dzmitry Bahdanau
Christopher D. Manning
200
51
0
28 Jul 2020
Hypothesis Only Baselines in Natural Language Inference
Adam Poliak
Jason Naradowsky
Aparajita Haldar
Rachel Rudinger
Benjamin Van Durme
187
574
0
02 May 2018
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLM
ObjD
279
39,083
0
01 Sep 2014
1