Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1606.06565
Cited By
v1
v2 (latest)
Concrete Problems in AI Safety
21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Concrete Problems in AI Safety"
50 / 1,379 papers shown
Adaptive Language-Guided Abstraction from Contrastive Explanations
Conference on Robot Learning (CoRL), 2024
Andi Peng
Belinda Z. Li
Ilia Sucholutsky
Nishanth Kumar
Julie A. Shah
Jacob Andreas
Andreea Bobu
OffRL
229
6
0
12 Sep 2024
Prompt Baking
Aman Bhargava
Cameron Witkowski
Alexander Detkov
Matt W. Thomson
AI4CE
368
3
0
04 Sep 2024
Revisiting Safe Exploration in Safe Reinforcement learning
David Eckel
Baohe Zhang
Joschka Bödecker
238
0
0
02 Sep 2024
DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data
Priyanka Chudasama
Anil Surisetty
Aakarsh Malhotra
Alok Singh
228
0
0
02 Sep 2024
Logit Scaling for Out-of-Distribution Detection
Machine Vision and Applications (MVA), 2024
Andrija Djurisic
Rosanne Liu
Mladen Nikolic
OODD
251
2
0
02 Sep 2024
Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis
Chiu-Chou Lin
Yu-Wei Shih
Kuei-Ting Kuo
Yu-Cheng Chen
Chien-Hua Chen
Wei-Chen Chiu
I-Chen Wu
164
0
0
30 Aug 2024
Explainable Artificial Intelligence: A Survey of Needs, Techniques, Applications, and Future Direction
Melkamu Mersha
Khang Lam
Joseph Wood
Ali AlShami
Jugal Kalita
XAI
AI4TS
711
94
0
30 Aug 2024
SpecGuard: Specification Aware Recovery for Robotic Autonomous Vehicles from Physical Attacks
Conference on Computer and Communications Security (CCS), 2024
Pritam Dash
Ethan Chan
Karthik Pattabiraman
AAML
180
10
0
27 Aug 2024
Advances in Preference-based Reinforcement Learning: A Review
IEEE International Conference on Systems, Man and Cybernetics (SMC), 2022
Youssef Abdelkareem
Shady Shehata
Fakhri Karray
OffRL
247
16
0
21 Aug 2024
Representation Alignment from Human Feedback for Cross-Embodiment Reward Learning from Mixed-Quality Demonstrations
Connor Mattson
Anurag Aribandi
Daniel S. Brown
279
0
0
10 Aug 2024
Your Classifier Can Be Secretly a Likelihood-Based OOD Detector
Jirayu Burapacheep
Yixuan Li
OODD
216
5
0
09 Aug 2024
Non-maximizing policies that fulfill multi-criterion aspirations in expectation
Algorithmic Decision Theory (ADT), 2024
Simon Dima
Simon Fischer
J. Heitzig
Joss Oliver
295
1
0
08 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
270
47
0
31 Jul 2024
Black box meta-learning intrinsic rewards for sparse-reward environments
Octavio Pappalardo
Rodrigo Ramele
Juan Miguel Santos
OffRL
288
1
0
31 Jul 2024
Need of AI in Modern Education: in the Eyes of Explainable AI (xAI)
Supriya Manna
Dionis Barcari
581
3
0
31 Jul 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Atsuyuki Miyai
Jingkang Yang
Jingyang Zhang
Yifei Ming
Sisir Dhakal
...
Yixuan Li
Hai "Helen" Li
Ziwei Liu
Toshihiko Yamasaki
Kiyoharu Aizawa
371
30
0
31 Jul 2024
A Differential Dynamic Programming Framework for Inverse Reinforcement Learning
Kun Cao
Xinhang Xu
Wanxin Jin
Karl H. Johansson
Lihua Xie
148
3
0
29 Jul 2024
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son
William Bankes
Sayak Ray Chowdhury
Brooks Paige
Ilija Bogunovic
437
8
0
26 Jul 2024
CP-Prompt: Composition-Based Cross-modal Prompting for Domain-Incremental Continual Learning
Yu Feng
Zhen Tian
Zhonghong Ou
Zongfu Han
Haoran Luo
Guangwei Zhang
Meina Song
CLL
VLM
210
18
0
22 Jul 2024
Building Machines that Learn and Think with People
Katherine M. Collins
Ilia Sucholutsky
Umang Bhatt
Kartik Chandra
Lionel Wong
...
Mark K. Ho
Vikash K. Mansinghka
Adrian Weller
Joshua B. Tenenbaum
Thomas Griffiths
311
84
0
22 Jul 2024
Data-Centric Human Preference with Rationales for Direct Preference Alignment
H. Just
Ming Jin
Anit Kumar Sahu
Huy Phan
Ruoxi Jia
524
3
0
19 Jul 2024
This Probably Looks Exactly Like That: An Invertible Prototypical Network
Zachariah Carmichael
Timothy Redgrave
Daniel Gonzalez Cedre
Walter J. Scheirer
BDL
322
6
0
16 Jul 2024
BadRobot: Jailbreaking Embodied LLMs in the Physical World
Hangtao Zhang
Chenyu Zhu
Xianlong Wang
Ziqi Zhou
Yichen Wang
...
Shengshan Hu
Leo Yu Zhang
Aishan Liu
Peijin Guo
Leo Yu Zhang
LM&Ro
438
2
0
16 Jul 2024
Evaluating AI Evaluation: Perils and Prospects
John Burden
ELM
223
13
0
12 Jul 2024
The Misclassification Likelihood Matrix: Some Classes Are More Likely To Be Misclassified Than Others
Daniel Sikar
Artur Garcez
Robin Bloomfield
Tillman Weyde
Kaleem Peeroo
Naman Singh
Maeve Hutchinson
Dany Laksono
Mirela Reljan-Delaney
350
2
0
10 Jul 2024
BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark
Nikita Chernyadev
Nicholas Backshall
Xiao Ma
Yunfan Lu
Younggyo Seo
Stephen James
281
27
0
10 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
358
37
0
06 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
309
63
0
05 Jul 2024
Spontaneous Reward Hacking in Iterative Self-Refinement
Jane Pan
He He
Samuel R. Bowman
Shi Feng
260
17
0
05 Jul 2024
FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning
Saandeep Aathreya
Shaun J. Canavan
OODD
283
3
0
03 Jul 2024
Reporting Risks in AI-based Assistive Technology Research: A Systematic Review
Zahra Ahmadi
Peter R. Lewis
Mahadeo Sukhai
140
0
0
01 Jul 2024
Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning
Gautham Vasan
Yan Wang
Fahim Shahriar
James Bergstra
Martin Jägersand
A. R. Mahmood
269
11
0
29 Jun 2024
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu
Yang Zhang
Xuchuan Huang
Jasmine Xinze Li
Yalan Qin
Yaodong Yang
AI4TS
278
9
0
28 Jun 2024
Multimodal foundation world models for generalist embodied agents
Pietro Mazzaglia
Tim Verbelen
Bart Dhoedt
Rameswar Panda
Sai Rajeswar
OffRL
LM&Ro
272
1
0
26 Jun 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
420
27
0
25 Jun 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
312
32
0
24 Jun 2024
OCALM: Object-Centric Assessment with Language Models
Timo Kaufmann
Johannes Czech
Antonia Wüst
Quentin Delfosse
Kristian Kersting
Eyke Hüllermeier
LM&Ro
LRM
284
1
0
24 Jun 2024
Improving robustness to corruptions with multiplicative weight perturbations
Trung Trinh
Markus Heinonen
Luigi Acerbi
Samuel Kaski
216
2
0
24 Jun 2024
Confidence Regulation Neurons in Language Models
Alessandro Stolfo
Ben Wu
Wes Gurnee
Yonatan Belinkov
Xingyi Song
Mrinmaya Sachan
Neel Nanda
242
39
0
24 Jun 2024
Learning Run-time Safety Monitors for Machine Learning Components
Ozan Vardal
Richard Hawkins
Colin Paterson
Chiara Picardi
Daniel Omeiza
Lars Kunze
Ibrahim Habli
191
0
0
23 Jun 2024
Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection
Eduardo Dadalto
F. Alberge
Pierre Duhamel
Pablo Piantanida
OODD
266
0
0
23 Jun 2024
Combining Neural Networks and Symbolic Regression for Analytical Lyapunov Function Discovery
Jie Feng
Haohan Zou
Yuanyuan Shi
351
2
0
21 Jun 2024
Input Conditioned Graph Generation for Language Agents
Lukas Vierling
Jie Fu
Kai Chen
LLMAG
133
2
0
17 Jun 2024
Exploring Parent-Child Perceptions on Safety in Generative AI: Concerns, Mitigation Strategies, and Design Implications
IEEE Symposium on Security and Privacy (S&P), 2024
Yaman Yu
Tanusree Sharma
Melinda Hu
Justin Wang
Yang Wang
153
18
0
15 Jun 2024
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
Neural Information Processing Systems (NeurIPS), 2024
Rui Yang
Ruomeng Ding
Yong Lin
Huan Zhang
Tong Zhang
291
98
0
14 Jun 2024
Beyond the Norms: Detecting Prediction Errors in Regression Models
A. Altieri
Marco Romanelli
Georg Pichler
F. Alberge
Pablo Piantanida
329
1
0
11 Jun 2024
Confidence-aware Contrastive Learning for Selective Classification
International Conference on Machine Learning (ICML), 2024
Yu-Chang Wu
Shen-Huan Lyu
Haopu Shang
Xiangyu Wang
Chao Qian
161
5
0
07 Jun 2024
The Reasonable Person Standard for AI
International Conference on Machine Learning (ICML), 2024
Sunayana Rane
47
2
0
07 Jun 2024
Learning Task Decomposition to Assist Humans in Competitive Programming
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Jiaxin Wen
Ruiqi Zhong
Pei Ke
Zhihong Shao
Hongning Wang
Shiyu Huang
ReLM
338
12
0
07 Jun 2024
A Survey of Language-Based Communication in Robotics
William Hunt
Sarvapali D. Ramchurn
Mohammad D. Soorati
LM&Ro
711
17
0
06 Jun 2024
Previous
1
2
3
...
6
7
8
...
26
27
28
Next