Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2112.00861
Cited By
v1
v2
v3 (latest)
A General Language Assistant as a Laboratory for Alignment
1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"A General Language Assistant as a Laboratory for Alignment"
50 / 701 papers shown
Title
Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models
Heloisa Caroline de Souza Pereira Candello
Muneeza Azmat
Uma Sushmitha Gunturi
Raya Horesh
Rogerio Abreu de Paula
Heloisa Pimentel
Marcelo Carpinette Grave
Aminat Adebiyi
Tiago Machado
M. Macedo
64
0
0
01 Dec 2025
Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Sitong Fang
Shiyi Hou
Kaile Wang
Boyuan Chen
Donghai Hong
Jiayi Zhou
Josef Dai
Yaodong Yang
Jiaming Ji
AAML
162
0
0
29 Nov 2025
MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning
Junjian Wang
Lidan Zhao
Xi Sheryl Zhang
161
0
0
26 Nov 2025
Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
Yusong Wu
Stephen Brade
Teng Ma
Tia-Jane Fowler
Enning Yang
Berker Banar
Aaron Courville
Natasha Jaques
Cheng-Zhi Anna Huang
AAML
128
0
0
22 Nov 2025
Why Do Language Model Agents Whistleblow?
Kushal Agrawal
Frank Xiao
Guido Bergman
Asa Cooper Stickland
LLMAG
290
0
0
21 Nov 2025
SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
Wei Xia
Zhi-Hong Deng
ALM
249
0
0
20 Nov 2025
Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior
Dalia Ali
Dora Zhao
Allison Koenecke
Orestis Papakyriakopoulos
199
0
0
18 Nov 2025
Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty
Zeyu Shi
Ziming Wang
Tianyu Chen
Shiqi Gao
Haoyi Zhou
Qingyun Sun
Jianxin Li
80
0
0
17 Nov 2025
Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments
Samuel Nathanson
Rebecca Williams
Cynthia Matuszek
AAML
253
1
0
16 Nov 2025
Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation
Hefei Xu
Le Wu
Chen Cheng
Hao Liu
114
0
0
15 Nov 2025
Steering Language Models with Weight Arithmetic
Constanza Fierro
Fabien Roger
MoMe
LLMSV
493
0
0
07 Nov 2025
DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding
Zixuan Liu
Siavash H. Khajavi
Guangkai Jiang
VLM
161
0
0
04 Nov 2025
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
Joshua Ashkinaze
Hua Shen
Sai Avula
Eric Gilbert
Ceren Budak
VLM
283
0
0
03 Nov 2025
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
Sharan Maiya
Henning Bartsch
Nathan Lambert
Evan Hubinger
110
0
0
03 Nov 2025
One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
Renhao Li
Jianhong Tu
Yang Su
Hamid Alinejad-Rokny
Derek F. Wong
Junyang Lin
Min Yang
LRM
151
0
0
30 Oct 2025
The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Ali Aouad
Aymane El Gadarri
Vivek F. Farias
153
0
0
28 Oct 2025
EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law
Ilija Lichkovski
Alexander Müller
Mariam Ibrahim
Tiwai Mhundwa
LLMAG
AILaw
ELM
268
0
0
24 Oct 2025
I Large Language Models possono nascondere un testo in un altro testo della stessa lunghezza
Antonio Norelli
Michael Bronstein
238
0
0
22 Oct 2025
Annotation-Efficient Universal Honesty Alignment
Shiyu Ni
Keping Bi
Jiafeng Guo
Minghao Tang
Jingtong Wu
Zengxin Han
Xueqi Cheng
HILM
148
0
0
20 Oct 2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Yu Ying Chiu
Michael S. Lee
Rachel Calcott
Brandon Handoko
Paul de Font-Reaulx
...
Mantas Mazeika
Bing Liu
Yejin Choi
Mitchell L. Gordon
Sydney Levine
ELM
LRM
125
0
0
18 Oct 2025
HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
Y. Liu
Lijun Li
X. Wang
Jing Shao
LLMSV
241
0
0
17 Oct 2025
Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences
Keertana Chidambaram
Karthik Vinary Seetharaman
Vasilis Syrgkanis
88
11
0
17 Oct 2025
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
Yuchun Miao
Liang Ding
Sen Zhang
Rong Bao
L. Zhang
Dacheng Tao
180
0
0
15 Oct 2025
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang
Meghaj Tarte
Yuzhao Heng
Xiang Chen
Tong Yu
Lingkai Kong
Sudheer Chava
Chao Zhang
88
0
0
14 Oct 2025
LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens
A. Zebaze
Rachel Bawden
Benoît Sagot
LRM
124
1
0
13 Oct 2025
A Vision for Access Control in LLM-based Agent Systems
X. Li
Dong Huang
Jie Li
Hongyi Cai
Zhenhong Zhou
Wei Dong
Xiaofeng Wang
Yang Liu
213
1
0
13 Oct 2025
Exploring and Leveraging Class Vectors for Classifier Editing
Jaeik Kim
Jaeyoung Do
VLM
178
0
0
13 Oct 2025
A-IPO: Adaptive Intent-driven Preference Optimization
Wenqing Wang
Muhammad Asif Ali
Ali Shoker
Ruohan Yang
Junyang Chen
Ying Sha
Huan Wang
81
0
0
11 Oct 2025
Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
Tuan Nguyen
Long Tran-Thanh
LLMAG
84
0
0
10 Oct 2025
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
Shuzhou Yuan
Ercong Nie
Yinuo Sun
Chenxuan Zhao
William LaCroix
Michael Färber
148
0
0
09 Oct 2025
VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands
Aofan Liu
Lulu Tang
MLLM
VLM
230
0
0
09 Oct 2025
Contrastive Weak-to-strong Generalization
Houcheng Jiang
Junfeng Fang
Jiaxin Wu
T. Zhang
Chen Gao
Yong Li
X. Wang
Xiangnan He
Yang Deng
128
0
0
09 Oct 2025
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Huahui Yi
Kun Wang
Qiankun Li
Miao Yu
Guanbin Li
Gongli Xi
H. Wu
Xuming Hu
Kang Li
Yang-Yang Liu
OffRL
LRM
235
2
0
08 Oct 2025
PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis
Saja Al-Dabet
Sherzod Turaev
Nazar Zaki
Arif O. Khan
Luai Eldweik
76
1
0
04 Oct 2025
AgenticRAG: Tool-Augmented Foundation Models for Zero-Shot Explainable Recommender Systems
Bo Ma
Hang Li
ZeHua Hu
XiaoFan Gui
LuYao Liu
Simon Liu
LRM
107
0
0
03 Oct 2025
InvThink: Towards AI Safety via Inverse Reasoning
Yubin Kim
Taehan Kim
Lizhou Fan
Chunjong Park
C. Breazeal
Daniel J. McDuff
Hae Won Park
ReLM
SILM
MU
LRM
AI4CE
257
1
0
02 Oct 2025
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
Yiran Shen
Yu Xia
Jonathan D. Chang
Prithviraj Ammanabrolu
148
0
0
01 Oct 2025
Generative Value Conflicts Reveal LLM Priorities
Andy Liu
Kshitish Ghate
Mona Diab
Daniel Fried
Atoosa Kasirzadeh
Max Kleiman-Weiner
132
1
0
29 Sep 2025
Reference-Free Rating of LLM Responses via Latent Information
Leander Girrbach
Chi-Ping Su
Tankred Saanum
Richard Socher
Eric Schulz
Zeynep Akata
120
0
0
29 Sep 2025
MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment
Yufei Li
Yu Fu
Yue Dong
Cong Liu
132
0
0
28 Sep 2025
AI Kill Switch for malicious web-based LLM agent
Sechan Lee
Sangdon Park
LLMAG
AAML
84
0
0
26 Sep 2025
We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
Gautam Siddharth Kashyap
Mark Dras
Usman Naseem
LLMSV
171
1
0
26 Sep 2025
Preemptive Detection and Steering of LLM Misalignment via Latent Reachability
Sathwik Karnik
Somil Bansal
LLMSV
130
2
0
25 Sep 2025
Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
Lorenzo Giusti
Ole Anton Werner
Riccardo Taiello
Matilde Carvalho Costa
Emre Tosun
...
Marc Molina
Rodrigo Lopes de Almeida
Paolo Cacace
Diogo Reis Santos
Luigi Serio
172
0
0
24 Sep 2025
A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users
Nishant Balepur
Matthew Shu
Yoo Yeon Sung
Seraphina Goldfarb-Tarrant
Shi Feng
Fumeng Yang
Rachel Rudinger
Jordan L. Boyd-Graber
194
0
0
23 Sep 2025
Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle
Keliang Liu
Dingkang Yang
Ziyun Qian
Weijie Yin
Y. Wang
Hongsheng Li
Jun Liu
Peng Zhai
Y. Liu
Lihua Zhang
OffRL
LRM
214
6
0
20 Sep 2025
Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs
Yang Liu
Chenhui Chu
128
0
0
17 Sep 2025
Opal: An Operator Algebra View of RLHF
Madhava Gaikwad
110
0
0
14 Sep 2025
Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment
Wirtschaftsinformatik (WI), 2025
Sascha Kaltenpoth
Oliver Müller
108
0
0
09 Sep 2025
EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models
Mohammad Reza Mirbagheri
Mohammad Mahdi Mirkamali
Zahra Motoshaker Arani
Ali Javeri
A. M. Sadeghzadeh
R. Jalili
HILM
174
0
0
08 Sep 2025
1
2
3
4
...
13
14
15
Next