v1v2 (latest)

Concrete Problems in AI Safety

21 June 2016

Papers citing "Concrete Problems in AI Safety"

50 / 1,379 papers shown

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

...

626

987

18 Mar 2025

Superalignment with Dynamic Human Values

Florian Mai

David Kaczér

Nicholas Kluge Corrêa

Lucie Flek

302

17 Mar 2025

From Autonomous Agents to Integrated Systems, A New Paradigm: Orchestrated Distributed Intelligence

Krti Tallam

AI4CE

349

17 Mar 2025

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

...

498

16 Mar 2025

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

441

126

14 Mar 2025

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

259

13 Mar 2025

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

332

13 Mar 2025

RPO: Fine-Tuning Visual Generative Models via Rich Vision-Language Preferences

573

13 Mar 2025

Generating Robot Constitutions & Benchmarks for Semantic Safety

405

11 Mar 2025

Mitigating Preference Hacking in Policy Optimization with Pessimism

291

10 Mar 2025

RePO: Understanding Preference Learning Through ReLU-Based Optimization

308

10 Mar 2025

Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity

306

08 Mar 2025

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

Calarina Muslimani

Kerrick Johnstonbaugh

186

08 Mar 2025

Blockchain As a Platform For Artificial Intelligence (AI) Transparency

230

07 Mar 2025

ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making

308

06 Mar 2025

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

523

05 Mar 2025

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

294

02 Mar 2025

HALO: Robust Out-of-Distribution Detection via Joint Optimisation

537

27 Feb 2025

Societal Alignment Frameworks Can Improve LLM Alignment

...

1.0K

27 Feb 2025

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

407

27 Feb 2025

RIZE: Adaptive Regularization for Imitation Learning

Adib Karimi

Mohammad Mehdi Ebadzadeh

OOD

273

27 Feb 2025

Reward Shaping to Mitigate Reward Hacking in RLHF

615

26 Feb 2025

Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic GraphsInternational Conference on Learning Representations (ICLR), 2025

450

25 Feb 2025

Logit Disagreement: OoD Detection with Bayesian Neural Networks

Kevin Raina

UQCV BDL UD PER

424

24 Feb 2025

A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

742

21 Feb 2025

Robust Concept Erasure Using Task Vectors

450

21 Feb 2025

Alignment, Agency and Autonomy in Frontier AI: A Systems Engineering Perspective

Krti Tallam

195

20 Feb 2025

Leveraging Intermediate Representations for Better Out-of-Distribution Detection

Gianluca Guglielmo

Marc Masana

OODD

274

18 Feb 2025

Transformer Dynamics: A neuroscientific approach to interpretability of large language models

Jesseba Fernando

Grigori Guitchounts

AI4CE

237

17 Feb 2025

Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

523

16 Feb 2025

FairDropout: Using Example-Tied Dropout to Enhance Generalization of Minority Groups

Géraldin Nanfack

Eugene Belilovsky

291

10 Feb 2025

Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

Aran Nayebi

571

09 Feb 2025

Why human-AI relationships need socioaffective alignmentHumanities and Social Sciences Communications (HSSC), 2025

236

04 Feb 2025

Process-Supervised Reinforcement Learning for Code Generation

354

03 Feb 2025

A statistically consistent measure of semantic uncertainty using Language Models

Yi Liu

332

01 Feb 2025

Constrained Hybrid Metaheuristic Algorithm for Probabilistic Neural Networks LearningInformation Sciences (Inf. Sci.), 2025

Piotr A. Kowalski

Szymon Kucharczyk

Jacek Mańdziuk

274

28 Jan 2025

The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems

265

28 Jan 2025

Temporal Logic Specification-Conditioned Decision Transformer for Offline Safe Reinforcement LearningInternational Conference on Machine Learning (ICML), 2024

293

28 Jan 2025

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language ModelsNeural Information Processing Systems (NeurIPS), 2024

730

28 Jan 2025

Evolution and The Knightian Blindspot of Machine Learning

338

22 Jan 2025

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

429

22 Jan 2025

Topology of Out-of-Distribution Examples in Deep Neural Networks

251

21 Jan 2025

A margin-based replacement for cross-entropy loss

Michael W. Spratling

Heiko H. Schütt

318

21 Jan 2025

Episodic memory in AI agents poses risks that should be studied and mitigated

Chad DeChant

457

20 Jan 2025

Two Types of AI Existential Risk: Decisive and AccumulativePhilosophical Studies (Philos. Stud.), 2024

Atoosa Kasirzadeh

490

20 Jan 2025

Learning to Assist Humans without Inferring RewardsNeural Information Processing Systems (NeurIPS), 2024

571

17 Jan 2025

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

...

460

16 Jan 2025

Iterative Label Refinement Matters More than Preference Optimization under Weak SupervisionInternational Conference on Learning Representations (ICLR), 2025

239

14 Jan 2025

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language ModelsAdvanced Video and Signal Based Surveillance (AVSS), 2025

268

13 Jan 2025

Large Language Models for BioinformaticsQuantitative Biology (QB), 2025

...

177

10 Jan 2025