Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

20 February 2025

Abstract

As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs' inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development.

View on arXiv

@article{wen2025_2406.14023,
  title={ Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective },
  author={ Yuchen Wen and Keping Bi and Wei Chen and Jiafeng Guo and Xueqi Cheng },
  journal={arXiv preprint arXiv:2406.14023},
  year={ 2025 }
}

Comments on this paper