Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1606.06565
Cited By
v1
v2 (latest)
Concrete Problems in AI Safety
21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Concrete Problems in AI Safety"
50 / 1,374 papers shown
Title
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models
Cameron R. Jones
Benjamin Bergen
439
12
0
22 Dec 2024
Predictive Monitoring of Black-Box Dynamical Systems
T. Henzinger
Fabian Kresse
Kaushik Mallik
Emily Yu
Đorđe Žikelić
152
1
0
21 Dec 2024
Neural Control and Certificate Repair via Runtime Monitoring
AAAI Conference on Artificial Intelligence (AAAI), 2024
Emily Yu
Đorđe Žikelić
T. Henzinger
AAML
186
1
0
17 Dec 2024
Neural Interactive Proofs
International Conference on Learning Representations (ICLR), 2024
Lewis Hammond
Sam Adam-Day
AAML
248
5
0
12 Dec 2024
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Chujie Zheng
Zizhuo Zhang
Beichen Zhang
Runji Lin
Keming Lu
Bowen Yu
Dayiheng Liu
Jingren Zhou
Junyang Lin
LRM
600
153
0
09 Dec 2024
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research
A. Feder Cooper
Christopher A. Choquette-Choo
Miranda Bogen
Matthew Jagielski
Katja Filippova
...
Hanna M. Wallach
Amy Cyphert
Katherine Lee
Nicolas Papernot
Katherine Lee
MU
AILaw
335
29
0
09 Dec 2024
Reinforcement Learning Enhanced LLMs: A Survey
Shuhe Wang
Shengyu Zhang
Jing Zhang
Runyi Hu
Xiaoya Li
Minlie Huang
Jiwei Li
Leilei Gan
G. Wang
Eduard H. Hovy
OffRL
662
48
0
05 Dec 2024
Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning
R. Krishnan
Piyush Khanna
Omesh Tickoo
HILM
272
5
0
03 Dec 2024
The Evolution and Future Perspectives of Artificial Intelligence Generated Content
Chengzhang Zhu
Luobin Cui
Ying Tang
Jiacun Wang
367
2
0
02 Dec 2024
Challenges in Human-Agent Communication
Gagan Bansal
J. W. Vaughan
Saleema Amershi
Eric Horvitz
Adam Fourney
Hussein Mozannar
Victor C. Dibia
Daniel S. Weld
LLMAG
AAML
AI4CE
249
10
0
28 Nov 2024
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers
Benedikt Stroebl
Sayash Kapoor
Arvind Narayanan
LRM
489
42
0
26 Nov 2024
Trustworthy artificial intelligence in the energy sector: Landscape analysis and evaluation framework
International Conference on Engineering, Technology and Innovation (ICE/IT), 2024
Sotiris Pelekis
Evangelos Karakolis
G. Lampropoulos
S. Mouzakitis
Ourania Markaki
Christos Ntanos
D. Askounis
320
2
0
25 Nov 2024
Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI
Computer Vision and Pattern Recognition (CVPR), 2024
Won Jun Kim
Hyungjin Chung
Jaemin Kim
Sangmin Lee
Byeongsu Sim
Jong Chul Ye
DiffM
349
2
0
22 Nov 2024
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
Neural Information Processing Systems (NeurIPS), 2024
Frédéric Berdoz
Roger Wattenhofer
200
1
0
21 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang
Xuehai Tang
Jizhong Han
Songlin Hu
239
4
0
18 Nov 2024
SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats
Ruoxi Sun
Jiamin Chang
Hammond Pearce
Chaowei Xiao
B. Li
Qi Wu
Surya Nepal
Minhui Xue
613
0
0
17 Nov 2024
Multi-agent Path Finding for Timed Tasks using Evolutionary Games
Sheryl Paul
Anand Balakrishnan
Xin Qin
Jyotirmoy V. Deshmukh
162
2
0
15 Nov 2024
Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games
Usman Anwar
Ashish Pandian
Jia Wan
David M. Krueger
Jakob N. Foerster
287
0
0
07 Nov 2024
Improving self-training under distribution shifts via anchored confidence with theoretical guarantees
Neural Information Processing Systems (NeurIPS), 2024
Taejong Joo
Diego Klabjan
UQCV
282
0
0
01 Nov 2024
Progressive Safeguards for Safe and Model-Agnostic Reinforcement Learning
Nabil Omi
Hosein Hasanbeig
Hiteshi Sharma
Sriram K. Rajamani
S. Sen
195
0
0
31 Oct 2024
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
Neural Information Processing Systems (NeurIPS), 2024
Weichao Zhou
Wenchao Li
219
2
0
31 Oct 2024
Adaptive Alignment: Dynamic Preference Adjustments via Multi-Objective Reinforcement Learning for Pluralistic AI
Hadassah Harland
Richard Dazeley
Peter Vamplew
Hashini Senaratne
Bahareh Nakisa
Francisco Cruz
321
3
0
31 Oct 2024
Democratizing Reward Design for Personal and Representative Value-Alignment
Carter Blair
Kate Larson
Edith Law
167
0
0
29 Oct 2024
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
Dario Pasquini
Evgenios M. Kornaropoulos
G. Ateniese
AAML
197
10
0
28 Oct 2024
Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment
Joshua T. S. Hewson
147
1
0
21 Oct 2024
We Urgently Need Intrinsically Kind Machines
Joshua T. S. Hewson
SyDa
124
0
0
21 Oct 2024
Balancing Label Quantity and Quality for Scalable Elicitation
Alex Troy Mallen
Nora Belrose
141
3
0
17 Oct 2024
Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards
Grant C. Forbes
Leonardo Villalobos-Arias
Jianxun Wang
Arnav Jhala
David L. Roberts
231
2
0
16 Oct 2024
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
Jared Joselowitz
Ritam Majumdar
Arjun Jagota
Matthieu Bou
Nyal Patel
Satyapriya Krishna
Sonali Parbhoo
195
0
0
16 Oct 2024
Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning
Bokai Hu
Sai Ashish Somayajula
Xin Pan
Zihan Huang
OffRL
391
5
0
14 Oct 2024
On Goodhart's law, with an application to value alignment
El-Mahdi El-Mhamdi
Lê-Nguyên Hoang
115
4
0
12 Oct 2024
Fragile Giants: Understanding the Susceptibility of Models to Subpopulation Attacks
Isha Gupta
Hidde Lycklama
Emanuel Opel
Evan Rose
Anwar Hithnawi
AAML
215
1
0
11 Oct 2024
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath
Changsoo Jung
Ethan Seefried
Nikhil Krishnaswamy
999
5
0
11 Oct 2024
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
International Conference on Learning Representations (ICLR), 2024
Weibin Liao
Xu Chu
Yasha Wang
LRM
413
13
0
10 Oct 2024
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Joris Postmus
Steven Abreu
LLMSV
712
8
0
09 Oct 2024
Diversity-Rewarded CFG Distillation
International Conference on Learning Representations (ICLR), 2024
Geoffrey Cideron
A. Agostinelli
Johan Ferret
Sertan Girgin
Romuald Elie
Olivier Bachem
Sarah Perrin
Alexandre Ramé
223
5
0
08 Oct 2024
Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards
Zhaohui Jiang
Xuening Feng
Paul Weng
Yifei Zhu
Yan Song
Tianze Zhou
Yujing Hu
Tangjie Lv
Changjie Fan
286
3
0
08 Oct 2024
Self-rationalization improves LLM as a fine-grained judge
Prapti Trivedi
Aditya Gulati
Oliver Molenschot
Meghana Arakkal Rajeev
Rajkumar Ramamurthy
Keith Stevens
Tanveesh Singh Chaudhery
Jahnavi Jambholkar
James Zou
Nazneen Rajani
LRM
255
15
0
07 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
262
15
0
06 Oct 2024
Moral Alignment for LLM Agents
International Conference on Learning Representations (ICLR), 2024
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
455
21
0
02 Oct 2024
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
International Conference on Learning Representations (ICLR), 2024
Angela Lopez-Cardona
Carlos Segura
Alexandros Karatzoglou
Sergi Abadal
Ioannis Arapakis
ALM
487
8
0
02 Oct 2024
Constraint-Aware Refinement for Safety Verification of Neural Feedback Loops
IEEE Control Systems Letters (L-CSS), 2024
Nicholas Rober
Jonathan P. How
231
4
0
30 Sep 2024
From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks
Roland Pihlakas
284
0
0
30 Sep 2024
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
Samuel Arnesen
David Rein
Julian Michael
ELM
207
9
0
25 Sep 2024
Reward-Robust RLHF in LLMs
Yuzi Yan
Xingzhou Lou
Jialian Li
Yiping Zhang
Jian Xie
Chao Yu
Yu Wang
Dong Yan
Yuan Shen
344
17
0
18 Sep 2024
Adaptive Language-Guided Abstraction from Contrastive Explanations
Conference on Robot Learning (CoRL), 2024
Andi Peng
Belinda Z. Li
Ilia Sucholutsky
Nishanth Kumar
Julie A. Shah
Jacob Andreas
Andreea Bobu
OffRL
188
5
0
12 Sep 2024
Prompt Baking
Aman Bhargava
Cameron Witkowski
Alexander Detkov
Matt W. Thomson
AI4CE
317
3
0
04 Sep 2024
Revisiting Safe Exploration in Safe Reinforcement learning
David Eckel
Baohe Zhang
Joschka Bödecker
197
0
0
02 Sep 2024
DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data
Priyanka Chudasama
Anil Surisetty
Aakarsh Malhotra
Alok Singh
212
0
0
02 Sep 2024
Logit Scaling for Out-of-Distribution Detection
Machine Vision and Applications (MVA), 2024
Andrija Djurisic
Rosanne Liu
Mladen Nikolic
OODD
212
2
0
02 Sep 2024
Previous
1
2
3
...
5
6
7
...
26
27
28
Next