Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation

Large Language Models (LLMs) are becoming ubiquitous, promising automation even in high-stakes scenarios. However, existing evaluation methods often fall short -- benchmarks saturate, accuracy-based metrics are overly simplistic, and many inherently ambiguous problems lack a clear ground truth. Given these limitations, evaluating fairness becomes complex. To address this, we reframe fairness evaluation using Borda scores, a method from voting theory, as a nuanced yet interpretable metric for measuring fairness. Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate for a kidney, and we assess fairness across demographics using proportional parity. In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes. Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases. Our findings highlight the potential of voting-based metrics to provide a richer, more multifaceted evaluation of LLM fairness.
View on arXiv@article{murray2025_2504.03716, title={ Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation }, author={ Hannah Murray and Brian Hyeongseok Kim and Isabelle Lee and Jason Byun and Dani Yogatama and Evi Micha }, journal={arXiv preprint arXiv:2504.03716}, year={ 2025 } }