ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1606.06565
  4. Cited By
Concrete Problems in AI Safety

Concrete Problems in AI Safety

21 June 2016
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
ArXivPDFHTML

Papers citing "Concrete Problems in AI Safety"

50 / 475 papers shown
Title
Topology of Out-of-Distribution Examples in Deep Neural Networks
Topology of Out-of-Distribution Examples in Deep Neural Networks
Esha Datta
Johanna Hennig
Eva Domschot
Connor Mattes
Michael R. Smith
72
0
0
21 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
67
2
0
20 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative
Two Types of AI Existential Risk: Decisive and Accumulative
Atoosa Kasirzadeh
65
14
0
20 Jan 2025
Learning to Assist Humans without Inferring Rewards
Learning to Assist Humans without Inferring Rewards
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
43
3
0
17 Jan 2025
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Chaoqi Wang
Zhuokai Zhao
Yibo Jiang
Zhaorun Chen
Chen Zhu
...
Jiayi Liu
Lizhu Zhang
Xiangjun Fan
Hao Ma
Sinong Wang
80
4
0
17 Jan 2025
Predictable Artificial Intelligence
Predictable Artificial Intelligence
Lexin Zhou
Pablo Antonio Moreno Casares
Fernando Martínez-Plumed
John Burden
Ryan Burnell
...
Seán Ó hÉigeartaigh
Danaja Rutar
Wout Schellaert
Konstantinos Voudouris
José Hernández-Orallo
51
2
0
08 Jan 2025
FairSense: Long-Term Fairness Analysis of ML-Enabled Systems
FairSense: Long-Term Fairness Analysis of ML-Enabled Systems
Yining She
Sumon Biswas
Christian Kastner
Eunsuk Kang
45
0
0
03 Jan 2025
Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications
Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications
Sinan Ibrahim
Mostafa Mostafa
Ali Jnadi
Hadi Salloum
Pavel Osinenko
OffRL
52
14
0
31 Dec 2024
Uncertainty quantification for improving radiomic-based models in radiation pneumonitis prediction
Uncertainty quantification for improving radiomic-based models in radiation pneumonitis prediction
Chanon Puttanawarut
Romen Samuel Wabina
Nat Sirirutbunkajorn
AI4CE
41
0
0
27 Dec 2024
Neural Interactive Proofs
Neural Interactive Proofs
Lewis Hammond
Sam Adam-Day
AAML
92
2
0
12 Dec 2024
ProcessBench: Identifying Process Errors in Mathematical Reasoning
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Chujie Zheng
Zizhuo Zhang
Beichen Zhang
Runji Lin
Keming Lu
Bowen Yu
Dayiheng Liu
Jingren Zhou
Junyang Lin
LRM
131
48
0
09 Dec 2024
Reinforcement Learning Enhanced LLMs: A Survey
Reinforcement Learning Enhanced LLMs: A Survey
Shuhe Wang
Shengyu Zhang
Jingyang Zhang
Runyi Hu
Xiaoya Li
Tianwei Zhang
Jiwei Li
Fei Wu
G. Wang
Eduard H. Hovy
OffRL
134
7
0
05 Dec 2024
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath
Changsoo Jung
Ethan Seefried
Nikhil Krishnaswamy
197
1
0
11 Oct 2024
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
Weibin Liao
Xu Chu
Yasha Wang
LRM
52
6
0
10 Oct 2024
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Joris Postmus
Steven Abreu
LLMSV
148
1
0
09 Oct 2024
Moral Alignment for LLM Agents
Moral Alignment for LLM Agents
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
45
1
0
02 Oct 2024
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
Angela Lopez-Cardona
Carlos Segura
Alexandros Karatzoglou
Sergi Abadal
Ioannis Arapakis
ALM
62
2
0
02 Oct 2024
Prompt Baking
Prompt Baking
Aman Bhargava
Cameron Witkowski
Alexander Detkov
Matt W. Thomson
AI4CE
38
0
0
04 Sep 2024
Explainable Artificial Intelligence: A Survey of Needs, Techniques, Applications, and Future Direction
Explainable Artificial Intelligence: A Survey of Needs, Techniques, Applications, and Future Direction
Melkamu Mersha
Khang Lam
Joseph Wood
Ali AlShami
Jugal Kalita
XAI
AI4TS
80
28
0
30 Aug 2024
Advances in Preference-based Reinforcement Learning: A Review
Advances in Preference-based Reinforcement Learning: A Review
Youssef Abdelkareem
Shady Shehata
Fakhri Karray
OffRL
51
9
0
21 Aug 2024
Non-maximizing policies that fulfill multi-criterion aspirations in expectation
Non-maximizing policies that fulfill multi-criterion aspirations in expectation
Simon Dima
Simon Fischer
J. Heitzig
Joss Oliver
28
1
0
08 Aug 2024
Need of AI in Modern Education: in the Eyes of Explainable AI (xAI)
Need of AI in Modern Education: in the Eyes of Explainable AI (xAI)
Supriya Manna
Dionis Barcari
60
3
0
31 Jul 2024
BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark
BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark
Nikita Chernyadev
Nicholas Backshall
Xiao Ma
Yunfan Lu
Younggyo Seo
Stephen James
29
11
0
10 Jul 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
46
9
0
25 Jun 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
62
14
0
24 Jun 2024
Improving robustness to corruptions with multiplicative weight
  perturbations
Improving robustness to corruptions with multiplicative weight perturbations
Trung Trinh
Markus Heinonen
Luigi Acerbi
Samuel Kaski
44
0
0
24 Jun 2024
Learning Run-time Safety Monitors for Machine Learning Components
Learning Run-time Safety Monitors for Machine Learning Components
Ozan Vardal
Richard Hawkins
Colin Paterson
Chiara Picardi
Daniel Omeiza
Lars Kunze
Ibrahim Habli
33
0
0
23 Jun 2024
Combine and Conquer: A Meta-Analysis on Data Shift and
  Out-of-Distribution Detection
Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection
Eduardo Dadalto
F. Alberge
Pierre Duhamel
Pablo Piantanida
OODD
57
0
0
23 Jun 2024
Input Conditioned Graph Generation for Language Agents
Input Conditioned Graph Generation for Language Agents
Lukas Vierling
Jie Fu
Kai Chen
LLMAG
63
2
0
17 Jun 2024
Exploring Parent-Child Perceptions on Safety in Generative AI: Concerns,
  Mitigation Strategies, and Design Implications
Exploring Parent-Child Perceptions on Safety in Generative AI: Concerns, Mitigation Strategies, and Design Implications
Yaman Yu
Tanusree Sharma
Melinda Hu
Justin Wang
Yang Wang
21
5
0
15 Jun 2024
Regularizing Hidden States Enables Learning Generalizable Reward Model
  for LLMs
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
Rui Yang
Ruomeng Ding
Yong Lin
Huan Zhang
Tong Zhang
44
43
0
14 Jun 2024
Learning Task Decomposition to Assist Humans in Competitive Programming
Learning Task Decomposition to Assist Humans in Competitive Programming
Jiaxin Wen
Ruiqi Zhong
Pei Ke
Zhihong Shao
Hongning Wang
Minlie Huang
ReLM
42
8
0
07 Jun 2024
A Survey of Language-Based Communication in Robotics
A Survey of Language-Based Communication in Robotics
William Hunt
Sarvapali D. Ramchurn
Mohammad D. Soorati
LM&Ro
65
12
0
06 Jun 2024
Inductive Generalization in Reinforcement Learning from Specifications
Inductive Generalization in Reinforcement Learning from Specifications
Vignesh Subramanian
Rohit Kushwah
Subhajit Roy
Suguman Bansal
OffRL
41
0
0
05 Jun 2024
Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise
Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise
T. Pouplin
Alan Jeffares
Nabeel Seedat
Mihaela van der Schaar
58
3
0
05 Jun 2024
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Tianren Zhang
Chujie Zhao
Guanyu Chen
Yizhou Jiang
Feng Chen
OOD
MLT
OODD
77
3
0
05 Jun 2024
A Generalized Apprenticeship Learning Framework for Modeling
  Heterogeneous Student Pedagogical Strategies
A Generalized Apprenticeship Learning Framework for Modeling Heterogeneous Student Pedagogical Strategies
Md Mirajul Islam
Xi Yang
J. Hostetter
Adittya Soukarjya Saha
Min Chi
29
1
0
04 Jun 2024
Offline Regularised Reinforcement Learning for Large Language Models
  Alignment
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond
Yunhao Tang
Daniel Guo
Daniele Calandriello
M. G. Azar
...
Gil Shamir
Rishabh Joshi
Tianqi Liu
Rémi Munos
Bilal Piot
OffRL
46
24
0
29 May 2024
Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding
Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding
Daniel Bethell
Simos Gerasimou
R. Calinescu
Calum Imrie
OffRL
OnRL
39
0
0
28 May 2024
WeiPer: OOD Detection using Weight Perturbations of Class Projections
WeiPer: OOD Detection using Weight Perturbations of Class Projections
Maximilian Granz
Manuel Heurich
Tim Landgraf
OODD
46
1
0
27 May 2024
Similarity-Navigated Conformal Prediction for Graph Neural Networks
Similarity-Navigated Conformal Prediction for Graph Neural Networks
Jianqing Song
Jianguo Huang
Wenyu Jiang
Baoming Zhang
Shuangjie Li
Chongjun Wang
43
2
0
23 May 2024
Understanding the performance gap between online and offline alignment
  algorithms
Understanding the performance gap between online and offline alignment algorithms
Yunhao Tang
Daniel Guo
Zeyu Zheng
Daniele Calandriello
Yuan Cao
...
Rémi Munos
Bernardo Avila-Pires
Michal Valko
Yong Cheng
Will Dabney
OffRL
OnRL
27
61
0
14 May 2024
Hybrid Convolutional Neural Networks with Reliability Guarantee
Hybrid Convolutional Neural Networks with Reliability Guarantee
Hans Dermot Doran
Suzana Veljanovska
29
2
0
08 May 2024
The Elephant in the Room -- Why AI Safety Demands Diverse Teams
The Elephant in the Room -- Why AI Safety Demands Diverse Teams
David Rostcheck
Lara Scheibling
33
0
0
07 May 2024
Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and
  Beyond: A Survey
Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A Survey
Rokas Gipiškis
Chun-Wei Tsai
Olga Kurasova
63
5
0
02 May 2024
Best Practices and Lessons Learned on Synthetic Data for Language Models
Best Practices and Lessons Learned on Synthetic Data for Language Models
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
...
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
SyDa
EgoV
43
87
0
11 Apr 2024
Monitoring Fidelity of Online Reinforcement Learning Algorithms in
  Clinical Trials
Monitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical Trials
Anna L. Trella
Kelly W. Zhang
Inbal Nahum-Shani
Vivek Shetty
Iris Yan
Finale Doshi-Velez
Susan A. Murphy
OffRL
OnRL
26
3
0
26 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
71
29
0
02 Feb 2024
A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation
A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation
Luca Valente
Alessandro Nadalini
Asif Veeran
Mattia Sinigaglia
Bruno Sá
...
Baker Mohammad
Sandro Pinto
Daniele Palossi
Luca Benini
Davide Rossi
35
5
0
07 Jan 2024
LLM-SAP: Large Language Models Situational Awareness Based Planning
LLM-SAP: Large Language Models Situational Awareness Based Planning
Liman Wang
Hanyang Zhong
LLMAG
35
2
0
26 Dec 2023
Previous
12345...8910
Next