Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1606.06565
Cited By

Concrete Problems in AI Safety

v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Jacob Steinhardt

Paul Christiano

Dandelion Mané

ArXiv (abs)PDF HTML

Papers citing "Concrete Problems in AI Safety"

50 / 1,379 papers shown

Inductive Generalization in Reinforcement Learning from Specifications

Inductive Generalization in Reinforcement Learning from Specifications

Vignesh Subramanian

326

1

0

05 Jun 2024

Scaling Laws for Reward Model Overoptimization in Direct Alignment
Algorithms

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Rafael Rafailov

Yaswanth Chittepu

Harshit S. Sikchi

Chelsea Finn

362

98

0

05 Jun 2024

Feature contamination: Neural networks learn uncorrelated features and fail to generalize

Feature contamination: Neural networks learn uncorrelated features and fail to generalize

Feng Chen

434

9

0

05 Jun 2024

Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise

Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise

Mihaela van der Schaar

751

7

0

05 Jun 2024

A Generalized Apprenticeship Learning Framework for Modeling
Heterogeneous Student Pedagogical Strategies

A Generalized Apprenticeship Learning Framework for Modeling Heterogeneous Student Pedagogical Strategies

Md Mirajul Islam

Adittya Soukarjya Saha

235

2

0

04 Jun 2024

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou

Yuanzhe Chen

Zhuo Chen

...

310

250

0

04 Jun 2024

SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in
Tabular MDP

SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP

Subhojyoti Mukherjee

Josiah P. Hanna

Robert Nowak

233

0

0

04 Jun 2024

Zero-Shot Out-of-Distribution Detection with Outlier Label Exposure

Zero-Shot Out-of-Distribution Detection with Outlier Label Exposure

244

6

0

03 Jun 2024

Policy Verification in Stochastic Dynamical Systems Using Logarithmic Neural Certificates

Policy Verification in Stochastic Dynamical Systems Using Logarithmic Neural Certificates

Thom S. Badings

Sebastian Junges

416

0

0

02 Jun 2024

Investigating Calibration and Corruption Robustness of Post-hoc Pruned
Perception CNNs: An Image Classification Benchmark Study

Investigating Calibration and Corruption Robustness of Post-hoc Pruned Perception CNNs: An Image Classification Benchmark Study

Gesina Schwalbe

210

4

0

31 May 2024

AI Safety: A Climb To Armageddon?

AI Safety: A Climb To Armageddon?

101

2

0

30 May 2024

AI Risk Management Should Incorporate Both Safety and Security

AI Risk Management Should Incorporate Both Safety and Security

Yi Zeng

Edoardo Debenedetti

...

Peter Henderson

271

20

0

29 May 2024

Offline Regularised Reinforcement Learning for Large Language Models
Alignment

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Pierre Harvey Richemond

Daniele Calandriello

...

Rishabh Joshi

Bilal Piot

238

41

0

29 May 2024

Efficient Model-agnostic Alignment via Bayesian Persuasion

Efficient Model-agnostic Alignment via Bayesian Persuasion

282

9

0

29 May 2024

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

Simos Gerasimou

469

1

0

28 May 2024

Exploring and steering the moral compass of Large Language Models

Exploring and steering the moral compass of Large Language Models

Alejandro Tlaie

223

6

0

27 May 2024

WeiPer: OOD Detection using Weight Perturbations of Class Projections

WeiPer: OOD Detection using Weight Perturbations of Class Projections

Maximilian Granz

317

3

0

27 May 2024

Crafting Interpretable Embeddings by Asking LLMs Questions

Crafting Interpretable Embeddings by Asking LLMs Questions

Richard Antonello

Alexander G. Huth

239

11

0

26 May 2024

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from
Human Input

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human InputInternational Conference on Machine Learning (ICML), 2024

223

4

0

23 May 2024

Similarity-Navigated Conformal Prediction for Graph Neural Networks

Similarity-Navigated Conformal Prediction for Graph Neural NetworksNeural Information Processing Systems (NeurIPS), 2024

317

7

0

23 May 2024

Online Self-Preferring Language Models

Online Self-Preferring Language Models

Kele Xu

Cheng Yang

179

0

0

23 May 2024

Human-AI Safety: A Descendant of Generative AI and Control Systems
Safety

Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

Andrea V. Bajcsy

248

8

0

16 May 2024

Understanding the performance gap between online and offline alignment
algorithms

Understanding the performance gap between online and offline alignment algorithms

Daniele Calandriello

...

Bernardo Avila-Pires

294

93

0

14 May 2024

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable
AI Systems

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

David Dalrymple

Stuart J. Russell

...

Ding Zhao

Joshua Tenenbaum

350

92

0

10 May 2024

One vs. Many: Comprehending Accurate Information from Multiple Erroneous
and Inconsistent AI Generations

One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI GenerationsConference on Fairness, Accountability and Transparency (FAccT), 2024

John Joon Young Chung

230

28

0

09 May 2024

Interpretable Cross-Examination Technique (ICE-T): Using highly
informative features to boost LLM performance

Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

140

1

0

08 May 2024

Hybrid Convolutional Neural Networks with Reliability Guarantee

Hybrid Convolutional Neural Networks with Reliability Guarantee

Hans Dermot Doran

Suzana Veljanovska

333

2

0

08 May 2024

Compressed Latent Replays for Lightweight Continual Learning on Spiking
Neural Networks

Compressed Latent Replays for Lightweight Continual Learning on Spiking Neural Networks

Alberto Dequino

Alessio Carpegna

Alessandro Savino

Luca Benini

Francesco Conti

264

3

0

08 May 2024

The Elephant in the Room -- Why AI Safety Demands Diverse Teams

The Elephant in the Room -- Why AI Safety Demands Diverse Teams

David Rostcheck

Lara Scheibling

167

1

0

07 May 2024

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration
Efficiency in Reinforcement Learning

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in Reinforcement Learning

222

9

0

06 May 2024

Semantic Objective Functions: A distribution-aware method for adding
logical constraints in deep learning

Semantic Objective Functions: A distribution-aware method for adding logical constraints in deep learningInternational Conference on Agents and Artificial Intelligence (ICAART), 2024

Miguel Ángel Méndez Lucero

Enrique Bojorquez Gallardo

183

2

0

03 May 2024

Generative AI in Cybersecurity

Generative AI in Cybersecurity

Michael P. Roman

Arturo F. Ehuan

148

10

0

02 May 2024

Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and
Beyond: A Survey

Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A SurveyICT express (IE), 2024

Rokas Gipiškis

333

30

0

02 May 2024

Efficient Exploration of Image Classifier Failures with Bayesian
Optimization and Text-to-Image Models

Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models

Houssem Ouertatani

Stéphane Herbin

179

0

0

26 Apr 2024

Taming False Positives in Out-of-Distribution Detection with Human
Feedback

Taming False Positives in Out-of-Distribution Detection with Human Feedback

Harit Vishwakarma

Ramya Korlakai Vinayak

266

9

0

25 Apr 2024

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society
of LLM Agents

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

Max Kleiman-Weiner

Bernhard Schölkopf

Mrinmaya Sachan

Amélie Reymond

396

53

0

25 Apr 2024

FedSI: Federated Subnetwork Inference for Efficient Uncertainty
Quantification

FedSI: Federated Subnetwork Inference for Efficient Uncertainty Quantification

209

2

0

24 Apr 2024

Stepwise Alignment for Constrained Language Model Policy Optimization

Stepwise Alignment for Constrained Language Model Policy Optimization

255

17

0

17 Apr 2024

Toward a Realistic Benchmark for Out-of-Distribution Detection

Toward a Realistic Benchmark for Out-of-Distribution Detection

Pietro Recalcati

Fabrizio Lamberti

286

1

0

16 Apr 2024

Best Practices and Lessons Learned on Synthetic Data for Language Models

Best Practices and Lessons Learned on Synthetic Data for Language Models

Ruibo Liu

...

Diyi Yang

304

112

0

11 Apr 2024

Reducing Human-Robot Goal State Divergence with Environment Design

Reducing Human-Robot Goal State Divergence with Environment Design

210

2

0

10 Apr 2024

Automatic Authorities: Power and AI

Automatic Authorities: Power and AI

122

3

0

09 Apr 2024

Deep Learning-Based Out-of-distribution Source Code Data Identification:
How Far Have We Gone?

Deep Learning-Based Out-of-distribution Source Code Data Identification: How Far Have We Gone?

Carsten Rudolph

229

2

0

09 Apr 2024

Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Michael Santacroce

Ahmed Hassan Awadallah

472

155

0

04 Apr 2024

Laser Learning Environment: A new environment for coordination-critical
multi-agent tasks

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

Yannick Molinghen

Mark Van Achter

213

1

0

04 Apr 2024

Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment

Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment

Tetsuro Morimura

443

3

0

01 Apr 2024

Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data

Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data

201

3

0

29 Mar 2024

Open-Set Recognition in the Age of Vision-Language Models

Open-Set Recognition in the Age of Vision-Language Models

Niko Sünderhauf

250

10

0

25 Mar 2024

Scaling Learning based Policy Optimization for Temporal Tasks via
Dropout

Scaling Learning based Policy Optimization for Temporal Tasks via Dropout

Danil Prokhorov

Georgios Fainekos

Jyotirmoy Deshmukh

178

2

0

23 Mar 2024

On the Detection of Anomalous or Out-Of-Distribution Data in Vision
Models Using Statistical Techniques

On the Detection of Anomalous or Out-Of-Distribution Data in Vision Models Using Statistical Techniques

David JP O'Sullivan

Nikola S. Nikolov

188

1

0

21 Mar 2024

1 2 3...7 8 9...26 27 28