Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2112.00861
Cited By

A General Language Assistant as a Laboratory for Alignment

v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021

Deep Ganguli

Nicholas Joseph

Zac Hatfield-Dodds

Danny Hernandez

Catherine Olsson

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 701 papers shown

Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models

Heloisa Caroline de Souza Pereira Candello

Uma Sushmitha Gunturi

Rogerio Abreu de Paula

Heloisa Pimentel

Marcelo Carpinette Grave

73

0

0

01 Dec 2025

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

187

0

0

29 Nov 2025

MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning

MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning

Xi Sheryl Zhang

192

0

0

26 Nov 2025

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Tia-Jane Fowler

Aaron Courville

Cheng-Zhi Anna Huang

139

0

0

22 Nov 2025

Why Do Language Model Agents Whistleblow?

Why Do Language Model Agents Whistleblow?

Asa Cooper Stickland

327

0

0

21 Nov 2025

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

271

0

0

20 Nov 2025

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Allison Koenecke

Orestis Papakyriakopoulos

220

1

0

18 Nov 2025

Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

91

0

0

17 Nov 2025

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson

Rebecca Williams

Cynthia Matuszek

277

1

0

16 Nov 2025

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation

121

0

0

15 Nov 2025

Steering Language Models with Weight Arithmetic

Steering Language Models with Weight Arithmetic

Constanza Fierro

513

0

0

07 Nov 2025

DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

Siavash H. Khajavi

182

0

0

04 Nov 2025

Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Joshua Ashkinaze

291

0

0

03 Nov 2025

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

Henning Bartsch

116

1

0

03 Nov 2025

ToolRM: Towards Agentic Tool-Use Reward Modeling

ToolRM: Towards Agentic Tool-Use Reward Modeling

Hamid Alinejad-Rokny

Junyang Lin

Min Yang

157

1

0

30 Oct 2025

The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

Aymane El Gadarri

Vivek F. Farias

180

0

0

28 Oct 2025

EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law

EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law

Ilija Lichkovski

Alexander Müller

LLMAG AILaw ELM

274

1

0

24 Oct 2025

I Large Language Models possono nascondere un testo in un altro testo della stessa lunghezza

I Large Language Models possono nascondere un testo in un altro testo della stessa lunghezza

Antonio Norelli

Michael Bronstein

287

0

0

22 Oct 2025

Annotation-Efficient Universal Honesty Alignment

Annotation-Efficient Universal Honesty Alignment

157

0

0

20 Oct 2025

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Brandon Handoko

Paul de Font-Reaulx

...

Mitchell L. Gordon

129

0

0

18 Oct 2025

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

253

0

0

17 Oct 2025

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Keertana Chidambaram

Karthik Vinary Seetharaman

Vasilis Syrgkanis

102

0

0

17 Oct 2025

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

186

0

0

15 Oct 2025

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

100

0

0

14 Oct 2025

LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

139

1

0

13 Oct 2025

A Vision for Access Control in LLM-based Agent Systems

A Vision for Access Control in LLM-based Agent Systems

235

1

0

13 Oct 2025

Exploring and Leveraging Class Vectors for Classifier Editing

Exploring and Leveraging Class Vectors for Classifier Editing

193

0

0

13 Oct 2025

A-IPO: Adaptive Intent-driven Preference Optimization

A-IPO: Adaptive Intent-driven Preference Optimization

Muhammad Asif Ali

92

0

0

11 Oct 2025

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Long Tran-Thanh

123

0

0

10 Oct 2025

Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

William LaCroix

Michael Färber

181

1

0

09 Oct 2025

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

239

0

0

09 Oct 2025

Contrastive Weak-to-strong Generalization

Contrastive Weak-to-strong Generalization

132

0

0

09 Oct 2025

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

267

2

0

08 Oct 2025

PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis

PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis

82

1

0

04 Oct 2025

AgenticRAG: Tool-Augmented Foundation Models for Zero-Shot Explainable Recommender Systems

AgenticRAG: Tool-Augmented Foundation Models for Zero-Shot Explainable Recommender Systems

127

0

0

03 Oct 2025

InvThink: Towards AI Safety via Inverse Reasoning

InvThink: Towards AI Safety via Inverse Reasoning

Daniel J. McDuff

ReLM SILM MU LRM AI4CE

263

1

0

02 Oct 2025

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

Jonathan D. Chang

Prithviraj Ammanabrolu

160

0

0

01 Oct 2025

Generative Value Conflicts Reveal LLM Priorities

Generative Value Conflicts Reveal LLM Priorities

Atoosa Kasirzadeh

Max Kleiman-Weiner

148

2

0

29 Sep 2025

Reference-Free Rating of LLM Responses via Latent Information

Reference-Free Rating of LLM Responses via Latent Information

Leander Girrbach

129

0

0

29 Sep 2025

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

146

0

0

28 Sep 2025

AI Kill Switch for malicious web-based LLM agent

AI Kill Switch for malicious web-based LLM agent

116

0

0

26 Sep 2025

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Gautam Siddharth Kashyap

194

1

0

26 Sep 2025

Preemptive Detection and Steering of LLM Misalignment via Latent Reachability

Preemptive Detection and Steering of LLM Misalignment via Latent Reachability

135

2

0

25 Sep 2025

Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI

Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI

Ole Anton Werner

Riccardo Taiello

Matilde Carvalho Costa

...

Rodrigo Lopes de Almeida

Diogo Reis Santos

179

0

0

24 Sep 2025

A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

Nishant Balepur

Seraphina Goldfarb-Tarrant

Rachel Rudinger

Jordan L. Boyd-Graber

208

1

0

23 Sep 2025

Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

227

8

0

20 Sep 2025

Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs

Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs

153

0

0

17 Sep 2025

Opal: An Operator Algebra View of RLHF

Opal: An Operator Algebra View of RLHF

Madhava Gaikwad

122

0

0

14 Sep 2025

Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment

Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model AlignmentWirtschaftsinformatik (WI), 2025

Sascha Kaltenpoth

112

1

0

09 Sep 2025

EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

Mohammad Reza Mirbagheri

Mohammad Mahdi Mirkamali

Zahra Motoshaker Arani

A. M. Sadeghzadeh

194

0

0

08 Sep 2025

1 2 3 4...13 14 15