v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Papers citing "Concrete Problems in AI Safety"

50 / 1,374 papers shown

Title
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models Cameron R. Jones Benjamin Bergen 439 12 0 22 Dec 2024
Predictive Monitoring of Black-Box Dynamical Systems T. Henzinger Fabian Kresse Kaushik Mallik Emily Yu Đorđe Žikelić 152 1 0 21 Dec 2024
Neural Control and Certificate Repair via Runtime MonitoringAAAI Conference on Artificial Intelligence (AAAI), 2024 Emily Yu Đorđe Žikelić T. Henzinger AAML 186 1 0 17 Dec 2024
Neural Interactive ProofsInternational Conference on Learning Representations (ICLR), 2024 Lewis Hammond Sam Adam-Day AAML 248 5 0 12 Dec 2024
ProcessBench: Identifying Process Errors in Mathematical ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Chujie Zheng Zizhuo Zhang Beichen Zhang Runji Lin Keming Lu Bowen Yu Dayiheng Liu Jingren Zhou Junyang Lin LRM 600 153 0 09 Dec 2024
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research A. Feder Cooper Christopher A. Choquette-Choo Miranda Bogen Matthew Jagielski Katja Filippova ... Hanna M. Wallach Amy Cyphert Katherine Lee Nicolas Papernot Katherine Lee MU AILaw 335 29 0 09 Dec 2024
Reinforcement Learning Enhanced LLMs: A Survey Shuhe Wang Shengyu Zhang Jing Zhang Runyi Hu Xiaoya Li Minlie Huang Jiwei Li Leilei Gan G. Wang Eduard H. Hovy OffRL 662 48 0 05 Dec 2024
Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning R. Krishnan Piyush Khanna Omesh Tickoo HILM 272 5 0 03 Dec 2024
The Evolution and Future Perspectives of Artificial Intelligence Generated Content Chengzhang Zhu Luobin Cui Ying Tang Jiacun Wang 367 2 0 02 Dec 2024
Challenges in Human-Agent Communication Gagan Bansal J. W. Vaughan Saleema Amershi Eric Horvitz Adam Fourney Hussein Mozannar Victor C. Dibia Daniel S. Weld LLMAG AAML AI4CE 249 10 0 28 Nov 2024
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers Benedikt Stroebl Sayash Kapoor Arvind Narayanan LRM 489 42 0 26 Nov 2024
Trustworthy artificial intelligence in the energy sector: Landscape analysis and evaluation frameworkInternational Conference on Engineering, Technology and Innovation (ICE/IT), 2024 Sotiris Pelekis Evangelos Karakolis G. Lampropoulos S. Mouzakitis Ourania Markaki Christos Ntanos D. Askounis 320 2 0 25 Nov 2024
Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAIComputer Vision and Pattern Recognition (CVPR), 2024 Won Jun Kim Hyungjin Chung Jaemin Kim Sangmin Lee Byeongsu Sim Jong Chul Ye DiffM 349 2 0 22 Nov 2024
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned PoliciesNeural Information Processing Systems (NeurIPS), 2024 Frédéric Berdoz Roger Wattenhofer 200 1 0 21 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models Xikang Yang Xuehai Tang Jizhong Han Songlin Hu 239 4 0 18 Nov 2024
SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats Ruoxi Sun Jiamin Chang Hammond Pearce Chaowei Xiao B. Li Qi Wu Surya Nepal Minhui Xue 613 0 0 17 Nov 2024
Multi-agent Path Finding for Timed Tasks using Evolutionary Games Sheryl Paul Anand Balakrishnan Xin Qin Jyotirmoy V. Deshmukh 162 2 0 15 Nov 2024
Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games Usman Anwar Ashish Pandian Jia Wan David M. Krueger Jakob N. Foerster 287 0 0 07 Nov 2024
Improving self-training under distribution shifts via anchored confidence with theoretical guaranteesNeural Information Processing Systems (NeurIPS), 2024 Taejong Joo Diego Klabjan UQCV 282 0 0 01 Nov 2024
Progressive Safeguards for Safe and Model-Agnostic Reinforcement Learning Nabil Omi Hosein Hasanbeig Hiteshi Sharma Sriram K. Rajamani S. Sen 195 0 0 31 Oct 2024
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task AlignmentNeural Information Processing Systems (NeurIPS), 2024 Weichao Zhou Wenchao Li 219 2 0 31 Oct 2024
Adaptive Alignment: Dynamic Preference Adjustments via Multi-Objective Reinforcement Learning for Pluralistic AI Hadassah Harland Richard Dazeley Peter Vamplew Hashini Senaratne Bahareh Nakisa Francisco Cruz 321 3 0 31 Oct 2024
Democratizing Reward Design for Personal and Representative Value-Alignment Carter Blair Kate Larson Edith Law 167 0 0 29 Oct 2024
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks Dario Pasquini Evgenios M. Kornaropoulos G. Ateniese AAML 197 10 0 28 Oct 2024
Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment Joshua T. S. Hewson 147 1 0 21 Oct 2024
We Urgently Need Intrinsically Kind Machines Joshua T. S. Hewson SyDa 124 0 0 21 Oct 2024
Balancing Label Quantity and Quality for Scalable Elicitation Alex Troy Mallen Nora Belrose 141 3 0 17 Oct 2024
Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards Grant C. Forbes Leonardo Villalobos-Arias Jianxun Wang Arnav Jhala David L. Roberts 231 2 0 16 Oct 2024
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning Jared Joselowitz Ritam Majumdar Arjun Jagota Matthieu Bou Nyal Patel Satyapriya Krishna Sonali Parbhoo 195 0 0 16 Oct 2024
Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning Bokai Hu Sai Ashish Somayajula Xin Pan Zihan Huang OffRL 391 5 0 14 Oct 2024
On Goodhart's law, with an application to value alignment El-Mahdi El-Mhamdi Lê-Nguyên Hoang 115 4 0 12 Oct 2024
Fragile Giants: Understanding the Susceptibility of Models to Subpopulation Attacks Isha Gupta Hidde Lycklama Emanuel Opel Evan Rose Anwar Hithnawi AAML 215 1 0 11 Oct 2024
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both Abhijnan Nath Changsoo Jung Ethan Seefried Nikhil Krishnaswamy 999 5 0 11 Oct 2024
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference TreesInternational Conference on Learning Representations (ICLR), 2024 Weibin Liao Xu Chu Yasha Wang LRM 413 13 0 10 Oct 2024
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering Joris Postmus Steven Abreu LLMSV 712 8 0 09 Oct 2024
Diversity-Rewarded CFG DistillationInternational Conference on Learning Representations (ICLR), 2024 Geoffrey Cideron A. Agostinelli Johan Ferret Sertan Girgin Romuald Elie Olivier Bachem Sarah Perrin Alexandre Ramé 223 5 0 08 Oct 2024
Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards Zhaohui Jiang Xuening Feng Paul Weng Yifei Zhu Yan Song Tianze Zhou Yujing Hu Tangjie Lv Changjie Fan 286 3 0 08 Oct 2024
Self-rationalization improves LLM as a fine-grained judge Prapti Trivedi Aditya Gulati Oliver Molenschot Meghana Arakkal Rajeev Rajkumar Ramamurthy Keith Stevens Tanveesh Singh Chaudhery Jahnavi Jambholkar James Zou Nazneen Rajani LRM 255 15 0 07 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions Yu-Shin Huang Peter Just Krishna Narayanan Chao Tian 262 15 0 06 Oct 2024
Moral Alignment for LLM AgentsInternational Conference on Learning Representations (ICLR), 2024 Elizaveta Tennant Stephen Hailes Mirco Musolesi 455 21 0 02 Oct 2024
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024 Angela Lopez-Cardona Carlos Segura Alexandros Karatzoglou Sergi Abadal Ioannis Arapakis ALM 487 8 0 02 Oct 2024
Constraint-Aware Refinement for Safety Verification of Neural Feedback LoopsIEEE Control Systems Letters (L-CSS), 2024 Nicholas Rober Jonathan P. How 231 4 0 30 Sep 2024
From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks Roland Pihlakas 284 0 0 30 Sep 2024
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy Samuel Arnesen David Rein Julian Michael ELM 207 9 0 25 Sep 2024
Reward-Robust RLHF in LLMs Yuzi Yan Xingzhou Lou Jialian Li Yiping Zhang Jian Xie Chao Yu Yu Wang Dong Yan Yuan Shen 344 17 0 18 Sep 2024
Adaptive Language-Guided Abstraction from Contrastive ExplanationsConference on Robot Learning (CoRL), 2024 Andi Peng Belinda Z. Li Ilia Sucholutsky Nishanth Kumar Julie A. Shah Jacob Andreas Andreea Bobu OffRL 188 5 0 12 Sep 2024
Prompt Baking Aman Bhargava Cameron Witkowski Alexander Detkov Matt W. Thomson AI4CE 317 3 0 04 Sep 2024
Revisiting Safe Exploration in Safe Reinforcement learning David Eckel Baohe Zhang Joschka Bödecker 197 0 0 02 Sep 2024
DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data Priyanka Chudasama Anil Surisetty Aakarsh Malhotra Alok Singh 212 0 0 02 Sep 2024
Logit Scaling for Out-of-Distribution DetectionMachine Vision and Applications (MVA), 2024 Andrija Djurisic Rosanne Liu Mladen Nikolic OODD 212 2 0 02 Sep 2024