STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive ProgressionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 |
Discovering Language Model Behaviors with Model-Written EvaluationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022 |
Discovering Latent Knowledge in Language Models Without SupervisionInternational Conference on Learning Representations (ICLR), 2022 |