Title
What Is AI Safety? What Do We Want It to Be? Jacqueline Harding Cameron Domenico Kirk-Giannini 64 0 0 05 May 2025
Assessing LLM code generation quality through path planning tasks Wanyi Chen Meng-Wen Su Mary L. Cummings ELM 53 0 0 30 Apr 2025
A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies Zhongren Chen Joshua Kalla Quan Le Shinpei Nakamura-Sakai Jasjeet Sekhon Ruixiao Wang 14 0 0 29 Apr 2025
What Makes an Evaluation Useful? Common Pitfalls and Best Practices Gil Gekker Meirav Segal Dan Lahav Omer Nevo ELM 43 0 0 30 Mar 2025
A Framework for Evaluating Emerging Cyberattack Capabilities of AI Mikel Rodriguez Raluca Ada Popa Four Flynn Lihao Liang Allan Dafoe Anna Wang ELM 53 2 0 14 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity HyunJin Kim Xiaoyuan Yi Jing Yao Muhua Huang Jinyeong Bak James Evans Xing Xie 39 0 0 08 Mar 2025
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs Lorenz Wolf Sangwoong Yoon Ilija Bogunovic 45 0 0 07 Mar 2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning Borong Zhang Yuhao Zhang Jiaming Ji Yingshan Lei Josef Dai Yuanpei Chen Yaodong Yang 66 4 0 05 Mar 2025
Adaptively evaluating models with task elicitation Davis Brown Prithvi Balehannina Helen Jin Shreya Havaldar Hamed Hassani Eric Wong ALM ELM 88 0 0 03 Mar 2025
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning Anh Tong Thanh Nguyen-Tang Dongeun Lee Duc Nguyen Toan M. Tran David Hall Cheongwoong Kang Jaesik Choi 33 0 0 03 Mar 2025
Practical Principles for AI Cost and Compute Accounting Stephen Casper Luke Bailey Tim Schreier 41 0 0 21 Feb 2025
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture John Burden Marko Tesic Lorenzo Pacchiardi José Hernández Orallo 31 0 0 21 Feb 2025
IPAD: Inverse Prompt for AI Detection -- A Robust and Explainable LLM-Generated Text Detector Zheng Chen Yushi Feng Changyang He Yue Deng Hongxi Pu Bo-wen Li DeLMO 42 1 0 21 Feb 2025
C3AI: Crafting and Evaluating Constitutions for Constitutional AI Yara Kyrychenko Ke Zhou Edyta Bogucka Daniele Quercia ELM 45 3 0 21 Feb 2025
Enabling External Scrutiny of AI Systems with Privacy-Enhancing Technologies Kendrea Beers Helen Toner 49 0 0 05 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che Stephen Casper Robert Kirk Anirudh Satheesh Stewart Slocum ... Zikui Cai Bilal Chughtai Y. Gal Furong Huang Dylan Hadfield-Menell MU AAML ELM 83 3 0 03 Feb 2025
Episodic memory in AI agents poses risks that should be studied and mitigated Chad DeChant 57 2 0 20 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative Atoosa Kasirzadeh 57 14 0 20 Jan 2025
Principles for Responsible AI Consciousness Research Patrick Butlin Theodoros Lappas 33 1 0 13 Jan 2025
OpenAI o1 System Card OpenAI OpenAI : Aaron Jaech Adam Tauman Kalai Adam Lerer ... Yuchen He Yuchen Zhang Yunyun Wang Zheng Shao Zhuohan Li ELM LRM AI4CE 77 1 0 21 Dec 2024
Measuring Goal-Directedness Matt MacDermott James Fox Francesco Belardinelli Tom Everitt 88 1 0 06 Dec 2024
Predicting Emergent Capabilities by Finetuning Charlie Snell Eric Wallace Dan Klein Sergey Levine ELM LRM 75 5 0 25 Nov 2024
The Dark Patterns of Personalized Persuasion in Large Language Models: Exposing Persuasive Linguistic Features for Big Five Personality Traits in LLMs Responses Wiktoria Mieleszczenko-Kowszewicz Dawid Płudowski Filip Kołodziejczyk Jakub Świstak Julian Sienkiewicz P. Biecek 71 2 0 08 Nov 2024
From Imitation to Introspection: Probing Self-Consciousness in Language Models Sirui Chen Shu Yu Shengjie Zhao Chaochao Lu MILM LRM 30 1 0 24 Oct 2024
Game Theory with Simulation in the Presence of Unpredictable Randomisation Vojtěch Kovařík Nathaniel Sauerberg Lewis Hammond Vincent Conitzer 19 1 0 18 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language Models Hannes Thurnherr Jérémy Scheurer 46 3 0 07 Sep 2024
Verification methods for international AI agreements Akash R. Wasil Tom Reed Jack William Miller Peter Barnett 29 2 0 28 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience Zhonghao He Jascha Achterberg Katie Collins Kevin K. Nejad Danyal Akarca ... Chole Li Kai J. Sandbrink Stephen Casper Anna Ivanova Grace W. Lindsay AI4CE 28 1 0 22 Aug 2024
On the Generalization of Preference Learning with DPO Shawn Im Yixuan Li 44 1 0 06 Aug 2024
Machine Unlearning in Generative AI: A Survey Zheyuan Liu Guangyao Dou Zhaoxuan Tan Yijun Tian Meng-Long Jiang MU 31 13 0 30 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Apurv Verma Satyapriya Krishna Sebastian Gehrmann Madhavan Seshadri Anu Pradhan Tom Ault Leslie Barrett David Rabinowitz John Doucette Nhathai Phan 47 9 0 20 Jul 2024
Large Language Models as Misleading Assistants in Conversation Betty Li Hou Kejian Shi Jason Phang James Aung Steven Adler Rosie Campbell 32 3 0 16 Jul 2024
The Oscars of AI Theater: A Survey on Role-Playing with Language Models Nuo Chen Yan Wang Yang Deng Jia Li 26 15 0 16 Jul 2024
Thorns and Algorithms: Navigating Generative AI Challenges Inspired by Giraffes and Acacias Waqar Hussain 38 0 0 16 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey) K. Kenthapadi M. Sameki Ankur Taly HILM ELM AILaw 34 12 0 10 Jul 2024
Adversaries Can Misuse Combinations of Safe Models Erik Jones Anca Dragan Jacob Steinhardt 40 6 0 20 Jun 2024
Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data Nahema Marchal Rachel Xu Rasmi Elasmar Iason Gabriel Beth Goldberg William S. Isaac LLMAG 21 13 0 19 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models Steffi Chern Zhulin Hu Yuqing Yang Ethan Chern Yuan Guo Jiahe Jin Binjie Wang Pengfei Liu HILM ALM 84 3 0 19 Jun 2024
IDs for AI Systems Alan Chan Noam Kolt Peter Wills Usman Anwar Christian Schroeder de Witt Nitarshan Rajkumar Lewis Hammond David M. Krueger Lennart Heim Markus Anderljung 41 6 0 17 Jun 2024
Effective Generative AI: The Human-Algorithm Centaur S. Saghafian Lihi Idan 38 7 0 16 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations Teun van der Weij Felix Hofstätter Ollie Jaffe Samuel F. Brown Francis Rhys Ward ELM 37 23 0 11 Jun 2024
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models Ling Shi Deyi Xiong ELM 28 1 0 07 Jun 2024
What is it for a Machine Learning Model to Have a Capability? Jacqueline Harding Nathaniel Sharadin ELM 31 3 0 14 May 2024
Risks and Opportunities of Open-Source Generative AI Francisco Eiras Aleksander Petrov Bertie Vidgen Christian Schroeder Fabio Pizzati ... Matthew Jackson Phillip H. S. Torr Trevor Darrell Y. Lee Jakob N. Foerster 40 18 0 14 May 2024
Generative AI in Cybersecurity Shivani Metta Isaac Chang Jack Parker Michael P. Roman Arturo F. Ehuan 34 4 0 02 May 2024
Near to Mid-term Risks and Opportunities of Open-Source Generative AI Francisco Eiras Aleksandar Petrov Bertie Vidgen Christian Schroeder de Witt Fabio Pizzati ... Paul Röttger Philip H. S. Torr Trevor Darrell Y. Lee Jakob N. Foerster 44 5 0 25 Apr 2024
Resistance Against Manipulative AI: key factors and possible actions Piotr Wilczyñski Wiktoria Mieleszczenko-Kowszewicz P. Biecek 29 3 0 22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models Laura Weidinger Joslyn Barnhart Jenny Brennan Christina Butterfield Susie Young ... Sebastian Farquhar Lewis Ho Iason Gabriel Allan Dafoe William S. Isaac ELM 27 8 0 22 Apr 2024
Responsible Reporting for Frontier AI Development Noam Kolt Markus Anderljung Joslyn Barnhart Asher Brass K. Esvelt Gillian K. Hadfield Lennart Heim Mikel Rodriguez Jonas B. Sandbrink Thomas Woodside 42 13 0 03 Apr 2024
Understanding the Learning Dynamics of Alignment with Human Feedback Shawn Im Yixuan Li ALM 32 11 0 27 Mar 2024