Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

1 June 2023

Papers citing "Rethinking Model Evaluation as Narrowing the Socio-Technical Gap"

21 / 21 papers shown

Title
DICE: A Framework for Dimensional and Contextual Evaluation of Language Models Aryan Shrivastava Paula Akemi Aoyagui 26 0 0 14 Apr 2025
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users Antonia Karamolegkou Malvina Nikandrou Georgios Pantazopoulos Danae Sanchez Villegas Phillip Rust Ruchira Dhar Daniel Hershcovich Anders Søgaard 34 0 0 28 Mar 2025
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning Sky CH-Wang Darshan Deshpande Smaranda Muresan Anand Kannappan Rebecca Qian 51 0 0 24 Mar 2025
SPHERE: An Evaluation Card for Human-AI Systems Qianou Ma Dora Zhao Xinran Zhao Chenglei Si Chenyang Yang Ryan Louie Ehud Reiter Diyi Yang Tongshuang Wu ALM 50 0 0 24 Mar 2025
VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures Yoo Yeon Sung H. Kim Dan Zhang 58 1 0 16 Mar 2025
Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs Zhaowei Zhang Fengshuo Bai Qizhi Chen Chengdong Ma Mingzhi Wang Haoran Sun Zilong Zheng Yaodong Yang 54 3 0 26 Feb 2025
Understanding the LLM-ification of CHI: Unpacking the Impact of LLMs at CHI through a Systematic Literature Review Rock Yuren Pang Hope Schroeder Kynnedy Simone Smith Solon Barocas Ziang Xiao Emily Tseng Danielle Bragg 73 3 0 22 Jan 2025
Can LLM "Self-report"?: Evaluating the Validity of Self-report Scales in Measuring Personality Design in LLM-based Chatbots Huiqi Zou Pengda Wang Zihan Yan Tianjun Sun Ziang Xiao 88 1 0 29 Nov 2024
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices Anka Reuel Amelia F. Hardy Chandler Smith Max Lamparth Malcolm Hardy Mykel J. Kochenderfer ELM 62 16 0 20 Nov 2024
What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study Beatrice Savoldi Sara Papi Matteo Negri Ana Guerberof L. Bentivogli 35 6 0 01 Oct 2024
Benchmarks as Microscopes: A Call for Model Metrology Michael Stephen Saxon Ari Holtzman Peter West William Yang Wang Naomi Saphra 21 4 0 22 Jul 2024
Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models Nikhil Sharma Kenton Murray Ziang Xiao 45 1 0 07 Jul 2024
ECBD: Evidence-Centered Benchmark Design for NLP Yu Lu Liu Su Lin Blodgett Jackie Chi Kit Cheung Q. Vera Liao Alexandra Olteanu Ziang Xiao 26 9 0 13 Jun 2024
(Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in Court Angela Jin Niloufar Salehi ELM 22 2 0 13 Mar 2024
Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking Nikhil Sharma Q. V. Liao Ziang Xiao 22 17 0 08 Feb 2024
Does Writing with Language Models Reduce Content Diversity? Vishakh Padmakumar He He 8 79 0 11 Sep 2023
Identifying and Mitigating the Security Risks of Generative AI Clark W. Barrett Bradley L Boyd Ellie Burzstein Nicholas Carlini Brad Chen ... Zulfikar Ramzan Khawaja Shams D. Song Ankur Taly Diyi Yang SILM 16 88 0 28 Aug 2023
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications Kaitlyn Zhou Su Lin Blodgett Adam Trischler Hal Daumé Kaheer Suleman Alexandra Olteanu ELM 94 25 0 13 May 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 225 3,658 0 28 Feb 2017
Teaching Machines to Read and Comprehend Karl Moritz Hermann Tomás Kociský Edward Grefenstette L. Espeholt W. Kay Mustafa Suleyman Phil Blunsom 170 3,504 0 10 Jun 2015