v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Papers citing "Concrete Problems in AI Safety"

50 / 1,379 papers shown

Adaptive Language-Guided Abstraction from Contrastive ExplanationsConference on Robot Learning (CoRL), 2024

Andi Peng

Belinda Z. Li

Ilia Sucholutsky

Nishanth Kumar

Julie A. Shah

Jacob Andreas

Andreea Bobu

OffRL

229

12 Sep 2024

368

04 Sep 2024

Revisiting Safe Exploration in Safe Reinforcement learning

David Eckel

Baohe Zhang

Joschka Bödecker

238

02 Sep 2024

DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data

228

02 Sep 2024

Logit Scaling for Out-of-Distribution DetectionMachine Vision and Applications (MVA), 2024

251

02 Sep 2024

Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis

Wei-Chen Chiu

I-Chen Wu

164

30 Aug 2024

Explainable Artificial Intelligence: A Survey of Needs, Techniques, Applications, and Future Direction

Khang Lam

711

30 Aug 2024

SpecGuard: Specification Aware Recovery for Robotic Autonomous Vehicles from Physical AttacksConference on Computer and Communications Security (CCS), 2024

180

27 Aug 2024

Advances in Preference-based Reinforcement Learning: A ReviewIEEE International Conference on Systems, Man and Cybernetics (SMC), 2022

247

21 Aug 2024

Representation Alignment from Human Feedback for Cross-Embodiment Reward Learning from Mixed-Quality Demonstrations

Connor Mattson

Anurag Aribandi

Daniel S. Brown

279

10 Aug 2024

Your Classifier Can Be Secretly a Likelihood-Based OOD Detector

Jirayu Burapacheep

Yixuan Li

OODD

216

09 Aug 2024

Non-maximizing policies that fulfill multi-criterion aspirations in expectationAlgorithmic Decision Theory (ADT), 2024

295

08 Aug 2024

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

...

270

31 Jul 2024

Black box meta-learning intrinsic rewards for sparse-reward environments

288

31 Jul 2024

Need of AI in Modern Education: in the Eyes of Explainable AI (xAI)

Supriya Manna

Dionis Barcari

581

31 Jul 2024

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

...

371

31 Jul 2024

A Differential Dynamic Programming Framework for Inverse Reinforcement Learning

Kun Cao

Xinhang Xu

Wanxin Jin

Karl H. Johansson

Lihua Xie

148

29 Jul 2024

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

437

26 Jul 2024

CP-Prompt: Composition-Based Cross-modal Prompting for Domain-Incremental Continual Learning

210

22 Jul 2024

Building Machines that Learn and Think with People

...

Joshua B. Tenenbaum

311

22 Jul 2024

Data-Centric Human Preference with Rationales for Direct Preference Alignment

524

19 Jul 2024

This Probably Looks Exactly Like That: An Invertible Prototypical Network

Zachariah Carmichael

Timothy Redgrave

Daniel Gonzalez Cedre

Walter J. Scheirer

BDL

322

16 Jul 2024

BadRobot: Jailbreaking Embodied LLMs in the Physical World

...

Aishan Liu

Peijin Guo

Leo Yu Zhang

LM&Ro

438

16 Jul 2024

Evaluating AI Evaluation: Perils and Prospects

John Burden

ELM

223

12 Jul 2024

The Misclassification Likelihood Matrix: Some Classes Are More Likely To Be Misclassified Than Others

Mirela Reljan-Delaney

350

10 Jul 2024

BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark

Xiao Ma

281

10 Jul 2024

AI Safety in Generative AI Large Language Models: A Survey

Lina Yao

358

06 Jul 2024

On scalable oversight with weak LLMs judging strong LLMs

...

Rohin Shah

309

05 Jul 2024

Spontaneous Reward Hacking in Iterative Self-Refinement

Jane Pan

He He

Samuel R. Bowman

Shi Feng

260

05 Jul 2024

FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning

Saandeep Aathreya

Shaun J. Canavan

OODD

283

03 Jul 2024

Reporting Risks in AI-based Assistive Technology Research: A Systematic Review

Zahra Ahmadi

Peter R. Lewis

Mahadeo Sukhai

140

01 Jul 2024

Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning

Yan Wang

269

29 Jun 2024

ProgressGym: Alignment with a Millennium of Moral Progress

Yaodong Yang

278

28 Jun 2024

Multimodal foundation world models for generalist embodied agents

272

26 Jun 2024

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Thom Lake

Eunsol Choi

Greg Durrett

420

25 Jun 2024

WARP: On the Benefits of Weight Averaged Rewarded Policies

312

24 Jun 2024

OCALM: Object-Centric Assessment with Language Models

Kristian Kersting

284

24 Jun 2024

Improving robustness to corruptions with multiplicative weight perturbations

216

24 Jun 2024

Confidence Regulation Neurons in Language Models

242

24 Jun 2024

Learning Run-time Safety Monitors for Machine Learning Components

191

23 Jun 2024

Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection

Eduardo Dadalto

F. Alberge

Pierre Duhamel

Pablo Piantanida

OODD

266

23 Jun 2024

Combining Neural Networks and Symbolic Regression for Analytical Lyapunov Function Discovery

Jie Feng

Haohan Zou

Yuanyuan Shi

351

21 Jun 2024

Input Conditioned Graph Generation for Language Agents

133

17 Jun 2024

Exploring Parent-Child Perceptions on Safety in Generative AI: Concerns, Mitigation Strategies, and Design ImplicationsIEEE Symposium on Security and Privacy (S&P), 2024

153

15 Jun 2024

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMsNeural Information Processing Systems (NeurIPS), 2024

Rui Yang

Ruomeng Ding

Yong Lin

Huan Zhang

Tong Zhang

291

14 Jun 2024

Beyond the Norms: Detecting Prediction Errors in Regression Models

Pablo Piantanida

329

11 Jun 2024

Confidence-aware Contrastive Learning for Selective ClassificationInternational Conference on Machine Learning (ICML), 2024

Chao Qian

161

07 Jun 2024

The Reasonable Person Standard for AIInternational Conference on Machine Learning (ICML), 2024

Sunayana Rane

07 Jun 2024

Learning Task Decomposition to Assist Humans in Competitive ProgrammingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Ruiqi Zhong

Hongning Wang

338

07 Jun 2024

A Survey of Language-Based Communication in Robotics

William Hunt

Sarvapali D. Ramchurn

Mohammad D. Soorati

LM&Ro

711

06 Jun 2024