v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Papers citing "Concrete Problems in AI Safety"

50 / 1,374 papers shown

Title
Large Language Models Develop Novel Social Biases Through Adaptive Exploration Addison J. Wu Ryan Liu Xuechunzi Bai Thomas Griffiths 140 0 0 24 Dec 2025
Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks Jiannan Guan Qiguang Chen L. Qin Dengyun Peng Jinhao Liu Liangyu Huo Jian Xie Wanxiang Che LRM 100 0 0 01 Dec 2025
Shielded Controller Units for RL with Operational Constraints Applied to Remote Microgrids Hadi Nekoei Alexandre Blondin Massé Rachid Hassani Sarath Chandar Vincent Mai 32 0 0 30 Nov 2025
Test-time scaling of diffusions with flow maps Amirmojtaba Sabour Michael Albergo Carles Domingo-Enrich Nicholas M. Boffi Sanja Fidler Karsten Kreis Eric Vanden-Eijnden DiffM 128 0 0 27 Nov 2025
Dataset Poisoning Attacks on Behavioral Cloning Policies Akansha Kalra Soumil Datta Ethan Gilmore Duc La Guanhong Tao Daniel S. Brown AAML OffRL 195 0 0 26 Nov 2025
Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning Radman Rakhshandehroo Daniel Coombs 84 0 0 22 Nov 2025
The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems Subramanyam Sahoo Jared Junkin 101 0 0 22 Nov 2025
Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research Ninell Oldenburg Ruchira Dhar Anders Søgaard 165 0 0 19 Nov 2025
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems Brendan Gho Suman Muppavarapu Afnan Shaik Tyson Tsay James Begin Kevin Zhu Archana Vaidheeswaran Vasu Sharma LLMAG 148 0 0 18 Nov 2025
Robust Experimental Design via Generalised Bayesian Inference Yasir Zubayr Barlas Sabina J. Sloman Samuel Kaski 104 0 0 10 Nov 2025
Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains Mohammed Musthafa Rafi Adarsh Krishnamurthy Aditya Balu 96 0 0 06 Nov 2025
Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies Gaia Grosso Sai Sumedh R. Hindupur Thomas Fel Samuel Bright-Thonney Philip Harris Demba Ba 257 1 0 05 Nov 2025
Trustworthy Quantum Machine Learning: A Roadmap for Reliability, Robustness, and Security in the NISQ Era Ferhat Ozgur Catak Jungwon Seo Umit Cali 92 0 0 04 Nov 2025
Lyapunov Stability Learning with Nonlinear Control via Inductive Biases Yupu Lu Shijie Lin Hao Xu Zeqing Zhang Jia Pan 60 0 0 03 Nov 2025
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences Joshua Ashkinaze Hua Shen Sai Avula Eric Gilbert Ceren Budak VLM 283 0 0 03 Nov 2025
Human-AI Complementarity: A Goal for Amplified Oversight Rishub Jain Sophie Bridgers Lili Janzer Rory Greig Tian Huey Teh Vladimir Mikulik 133 2 0 30 Oct 2025
Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters Mustafa Fuad Rifet Ibrahim Maurice Meijer Alexander Schlaefer Peer Stelldinger 162 0 0 30 Oct 2025
The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence Christian Dittrich Jennifer Flygare Kinne CML 207 0 0 29 Oct 2025
Decision-Making Amid Information-Based Threats in Sociotechnical Systems: A Review Aaron R. Allred Erin E. Richardson Sarah R. Bostrom James Crum Cara Spencer Chad Tossell Richard E. Niemeyer Leanne Hirshfield Allison P.A. Hayman 52 0 0 28 Oct 2025
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration Abhijnan Nath Nikhil Krishnaswamy 118 0 0 26 Oct 2025
Scalable Oversight via Partitioned Human Supervision Ren Yin Takashi Ishida Masashi Sugiyama 136 0 0 26 Oct 2025
Weak-to-Strong Generalization under Distribution Shifts Myeongho Jeon Jan Sobotka Suhwan Choi Maria Brbić OOD 184 0 0 24 Oct 2025
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases Ziqian Zhong Aditi Raghunathan Nicholas Carlini 64 2 0 23 Oct 2025
Ask a Strong LLM Judge when Your Reward Model is Uncertain Zhenghao Xu Qin Lu Qingru Zhang Liang Qiu Ilgee Hong ... Yao Liu Haoming Jiang Lihong Li Hyokun Yun Tuo Zhao 128 0 0 23 Oct 2025
Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection Yongqiang Chen Gang Niu James Cheng Bo Han Masashi Sugiyama 84 0 0 23 Oct 2025
The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems Bentley DeVilling ReLM LRM 325 0 0 23 Oct 2025
Subliminal Corruption: Mechanisms, Thresholds, and Interpretability Reya Vir Sarvesh Bhatnagar 84 0 0 22 Oct 2025
Rectifying Shortcut Behaviors in Preference-based Reward Learning Wenqian Ye Guangtao Zheng Aidong Zhang 100 0 0 21 Oct 2025
Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation Daniel Bethell Simos Gerasimou R. Calinescu Calum Imrie 84 0 0 21 Oct 2025
Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories Achref Jaziri Martin Rogmann Martin Mundt Visvanathan Ramesh 201 0 0 20 Oct 2025
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation Yuquan Xue Guanxing Lu Zhenyu Wu Chuanrui Zhang Bofang Jia Zhengyi Gu Yansong Tang Ziwei Wang 178 0 0 20 Oct 2025
Consistent Zero-Shot Imitation with Contrastive Goal Inference Kathryn Wantlin Chongyi Zheng Benjamin Eysenbach 156 0 0 20 Oct 2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes Yu Ying Chiu Michael S. Lee Rachel Calcott Brandon Handoko Paul de Font-Reaulx ... Mantas Mazeika Bing Liu Yejin Choi Mitchell L. Gordon Sydney Levine ELM LRM 121 0 0 18 Oct 2025
Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals Andrejs Sorstkins Omer Tariq Muhammad Bilal OffRL 104 0 0 16 Oct 2025
Restoring Noisy Demonstration for Imitation Learning With Diffusion ModelsIEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS), 2025 Shang-Fu Chen Co Yong Shao-Hua Sun DiffM 120 0 0 16 Oct 2025
Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism Xiaoshu Chen Sihang Zhou Ke Liang Duanyang Yuan Haoyuan Chen Xiaoyu Sun Linyuan Meng Xinwang Liu ReLM LRM 221 0 0 15 Oct 2025
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs María Victoria Carro Denise Alejandra Mester Facundo Nieto Oscar Agustín Stanchi Guido Ernesto Bergman ... Luca Nicolás Forziati Gangi Francisca Gauna Selasco Juan Gustavo Corvalán Gerardo Simari Maria Vanina Martinez 162 0 0 15 Oct 2025
From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models Imran Khan LRM 80 1 0 14 Oct 2025
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking Stephane Hatgis-Kessell Logan Mondal Bhamidipaty Emma Brunskill 105 0 0 14 Oct 2025
SafeMT: Multi-turn Safety for Multimodal Language Models Han Zhu Juntao Dai Jiaming Ji Haoran Li Chengkun Cai ... Chi-Min Chan Boyuan Chen Yaodong Yang Sirui Han Yike Guo 126 0 0 14 Oct 2025
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents Shengjie Ma Chenlong Deng Jiaxin Mao J. Huang Teng Wang Junjie Wu Changwang Zhang Jun Wang 76 1 0 13 Oct 2025
Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning Zexu Sun Yongcheng Zeng Erxue Min Heyang Gao Bokai Ji Xu Chen OffRL ReLM LRM 199 0 0 13 Oct 2025
Source-Free Object Detection with Detection TransformerIEEE Transactions on Image Processing (IEEE TIP), 2025 Huizai Yao Sicheng Zhao Shuo Lu Hui Chen Yangyang Li Guoping Liu Tengfei Xing C. Yan Jianhua Tao Guiguang Ding ViT 73 2 0 13 Oct 2025
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling Murad Dawood Usama Ahmed Siddiquie Shahram Khorshidi Maren Bennewitz 136 0 0 13 Oct 2025
Token Is All You Price Weijie Zhong 41 0 0 10 Oct 2025
Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges Christian Bluethgen Dave Van Veen Daniel Truhn Jakob Nikolas Kather Michael Moor ... Akshay S. Chaudhari Thomas Frauenfelder C. Langlotz Michael Krauthammer Farhad Nooralahzadeh LM&MA AI4CE 257 0 0 10 Oct 2025
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B Nisar Ahmed Muhammad Imran Zaman Gulshan Saleem Ali Hassan LRM 107 0 0 08 Oct 2025
Label Semantics for Robust Hyperspectral Image Classification Rafin Hassan Zarin Tasnim Roshni Rafiqul Bari Alimul Islam Nabeel Mohammed Moshiur Farazi Shafin Rahman VLM 86 1 0 08 Oct 2025
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment Radha Gulhane Sathish Reddy Indurthi OffRL LRM 56 0 0 06 Oct 2025
HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model Peter Van Katwyk Karianne J. Bergen 171 0 0 06 Oct 2025