Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

20 May 2025

Abstract

Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.

View on arXiv

@article{banerjee2025_2505.14469,
  title={ Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations },
  author={ Somnath Banerjee and Pratyush Chatterjee and Shanu Kumar and Sayan Layek and Parag Agrawal and Rima Hazra and Animesh Mukherjee },
  journal={arXiv preprint arXiv:2505.14469},
  year={ 2025 }
}

Comments on this paper