The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

2 August 2024

Hannah Chen

Yangfeng Ji

David Evans

ArXiv (abs)PDF HTML

Main:8 Pages

13 Figures

Bibliography:4 Pages

2 Tables

Appendix:3 Pages

Abstract

Large language models (LLMs) are now being considered and even deployed for applications that support high-stakes decision-making, such as recruitment and clinical decisions. While several methods have been proposed for measuring bias, there remains a gap between predictions, which are what the proposed methods consider, and how they are used to make decisions. In this work, we introduce Rank-Allocational-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in LLM predictions. We compare RABBI and current bias metrics on two allocation decision tasks. We evaluate their predictive validity across ten LLMs and utility for model selection. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes, whereas RABBI exhibits a strong correlation with allocation disparities. Our work highlights the need to account for how models are used in contexts with limited resource constraints.

View on arXiv

Comments on this paper