Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation

29 March 2025

Abstract

Large Language Models (LLMs) are becoming ubiquitous, promising automation even in high-stakes scenarios. However, existing evaluation methods often fall short -- benchmarks saturate, accuracy-based metrics are overly simplistic, and many inherently ambiguous problems lack a clear ground truth. Given these limitations, evaluating fairness becomes complex. To address this, we reframe fairness evaluation using Borda scores, a method from voting theory, as a nuanced yet interpretable metric for measuring fairness. Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate for a kidney, and we assess fairness across demographics using proportional parity. In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes. Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases. Our findings highlight the potential of voting-based metrics to provide a richer, more multifaceted evaluation of LLM fairness.

View on arXiv

@article{murray2025_2504.03716,
  title={ Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation },
  author={ Hannah Murray and Brian Hyeongseok Kim and Isabelle Lee and Jason Byun and Dani Yogatama and Evi Micha },
  journal={arXiv preprint arXiv:2504.03716},
  year={ 2025 }
}

Comments on this paper