565
v1v2v3v4v5 (latest)

Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency

Main:10 Pages
16 Figures
Bibliography:4 Pages
10 Tables
Appendix:10 Pages
Abstract

We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current alignment evaluation metrics often overlook stereotypes' randomness caused by LLMs' inconsistent generative behavior. For instance, LLMs may display contradictory stereotypes, such as those related to gender or race, for identical professions in different contexts. Ignoring this inconsistency risks misleading conclusions in alignment assessments and undermines efforts to evaluate the potential of LLMs to perpetuate or amplify social biases and unfairness.To address this, we propose the Bias-Volatility Framework (BVF), which estimates the probability distribution of stereotypes in LLM outputs. By capturing the variation in generative behavior, BVF assesses both the likelihood and degree to which LLM outputs negatively impact vulnerable groups, enabling a quantification of aggregated discrimination risk. Additionally, we introduce a mathematical framework to decompose this risk into bias risk (from the mean of the stereotype distribution) and volatility risk (from its variation). Applying BVF to 12 widely used LLMs, we find: i) Bias risk is the dominant contributor to discrimination; ii) Most LLMs exhibit substantial pro-male stereotypes across nearly all professions; iii) Reinforcement learning from human feedback reduces bias but increases volatility; iv) Discrimination risk correlates with socio-economic factors, such as professional salaries. Finally, we highlight BVF's broader applicability for assessing how generation inconsistencies in LLMs impact behavior beyond stereotypes.

View on arXiv
Comments on this paper