ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.00861
  4. Cited By
A General Language Assistant as a Laboratory for Alignment
v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
    ALM
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 698 papers shown
Title
Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient
Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient
Zhongzhu Zhou
Yibo Yang
Ziyan Chen
Fengxiang Bie
Haojun Xia
Xiaoxia Wu
Robert Wu
Ben Athiwaratkun
Bernard Ghanem
Shuaiwen Leon Song
84
0
0
02 Sep 2025
Generative KI für TA
Generative KI für TA
Wolfgang Eppler
Reinhard Heil
104
0
0
02 Sep 2025
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Rohit Patel
OffRL
148
0
0
02 Sep 2025
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention
Xiaomeng Hu
Fei Huang
Chenhan Yuan
Junyang Lin
Tsung-Yi Ho
68
1
0
01 Sep 2025
GIER: Gap-Driven Self-Refinement for Large Language Models
GIER: Gap-Driven Self-Refinement for Large Language Models
Rinku Dewri
LRM
94
0
0
30 Aug 2025
JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring
JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring
Junjie Chu
Mingjie Li
Ziqing Yang
Ye Leng
Chenhao Lin
Chao Shen
Michael Backes
Yun Shen
Yang Zhang
60
2
0
28 Aug 2025
Ensemble Debates with Local Large Language Models for AI Alignment
Ensemble Debates with Local Large Language Models for AI Alignment
Ephraiem Sarabamoun
ELM
198
0
0
27 Aug 2025
SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation
SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation
Garima Chhikara
Kripabandhu Ghosh
Abhijnan Chakraborty
56
0
0
25 Aug 2025
POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization
POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization
Xinyu Li
Tianjin Huang
Ronghui Mu
Xiaowei Huang
Gaojie Jin
LRM
68
0
0
23 Aug 2025
Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens
Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens
Ilias Chalkidis
OffRLALM
104
1
0
23 Aug 2025
Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making
Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making
Yuanjun Feng
Vivek Choudhary
Y. Shrestha
56
0
0
21 Aug 2025
LM Agents May Fail to Act on Their Own Risk Knowledge
LM Agents May Fail to Act on Their Own Risk Knowledge
Yuzhi Tang
Tianxiao Li
Elizabeth Li
Chris J. Maddison
Honghua Dong
Yangjun Ruan
LLMAGELM
1.6K
0
0
19 Aug 2025
Mitigating Jailbreaks with Intent-Aware LLMs
Mitigating Jailbreaks with Intent-Aware LLMs
Wei Jie Yeo
Frank Xing
Erik Cambria
AAML
88
0
0
16 Aug 2025
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Xuan Qi
Rongwu Xu
Zhijing Jin
64
1
0
06 Aug 2025
Are Today's LLMs Ready to Explain Well-Being Concepts?
Are Today's LLMs Ready to Explain Well-Being Concepts?
Bohan Jiang
Dawei Li
Zhen Tan
Chengshuai Zhao
Huan Liu
AI4MH
121
0
0
06 Aug 2025
FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data
FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data
Thibaut Thonet
Germán Kruszewski
Jos Rozen
Pierre Erbacher
Marc Dymetman
112
1
0
06 Aug 2025
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
A. Das
Vinija Jain
Vasu Sharma
LLMSV
90
0
0
04 Aug 2025
MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
Yanxu Zhu
Shitong Duan
Xiangxu Zhang
Jitao Sang
Peng Zhang
Tun Lu
Xiao Zhou
Jing Yao
Xiaoyuan Yi
Xing Xie
98
0
0
29 Jul 2025
Libra: Large Chinese-based Safeguard for AI Content
Libra: Large Chinese-based Safeguard for AI Content
Ziyang Chen
Huimu Yu
Xing Wu
Dongqin Liu
Songlin Hu
AILaw
95
0
0
29 Jul 2025
SDD: Self-Degraded Defense against Malicious Fine-tuning
SDD: Self-Degraded Defense against Malicious Fine-tuningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
ZiXuan Chen
Weikai Lu
Xin Lin
Ziqian Zeng
AAML
107
0
0
27 Jul 2025
DxHF: Providing High-Quality Human Feedback for LLM Alignment via Interactive Decomposition
DxHF: Providing High-Quality Human Feedback for LLM Alignment via Interactive DecompositionACM Symposium on User Interface Software and Technology (UIST), 2025
Danqing Shi
Furui Cheng
Tino Weinkauf
Antti Oulasvirta
Mennatallah El-Assady
96
1
0
24 Jul 2025
Justifications for Democratizing AI Alignment and Their Prospects
Justifications for Democratizing AI Alignment and Their Prospects
André Steingrüber
Kevin Baum
68
0
0
24 Jul 2025
From Seed to Harvest: Augmenting Human Creativity with AI for Red-teaming Text-to-Image Models
Jessica Quaye
Charvi Rastogi
Alicia Parrish
Oana Inel
Minsuk Kahng
Lora Aroyo
Vijay Janapa Reddi
119
0
0
23 Jul 2025
PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization
PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization
Han Jiang
Dongyao Zhu
Zhihua Wei
Xiaoyuan Yi
Ziang Xiao
Xing Xie
135
1
0
22 Jul 2025
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
ZhengLin Lai
MengYao Liao
Bingzhe Wu
Dong Xu
Zebin Zhao
Zhihang Yuan
Chao Fan
Jianqiang Li
MoE
107
1
0
20 Jun 2025
Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning
Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning
Duc Hieu Ho
Chenglin Fan
HILMLRM
127
1
0
19 Jun 2025
GRAM: A Generative Foundation Reward Model for Reward Generalization
GRAM: A Generative Foundation Reward Model for Reward Generalization
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Qiaozhi He
...
Bei Li
Tong Xiao
Chunliang Zhang
Tongran Liu
Jingbo Zhu
ALMOffRLLRM
226
12
0
17 Jun 2025
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
Zijie Wu
Chaohui Yu
Fan Wang
Xiang Bai
AI4CE
191
4
0
11 Jun 2025
LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs
Manooshree Patel
Rayna Bhattacharyya
Thomas Lu
Arnav Mehta
Niels Voss
Narges Norouzi
Gireeja Ranade
165
1
0
10 Jun 2025
A Survey on Large Language Models for Mathematical Reasoning
Peng-Yuan Wang
Tian-Shuo Liu
Chenyang Wang
Yi-Di Wang
Shu Yan
...
Xu-Hui Liu
Xin-Wei Chen
Jia-Cheng Xu
Ziniu Li
Yang Yu
LRM
184
13
0
10 Jun 2025
GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO
GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO
Yiyang Zhao
Huiyu Bai
Xuejiao Zhao
OffRL
137
0
0
10 Jun 2025
HauntAttack: When Attack Follows Reasoning as a Shadow
HauntAttack: When Attack Follows Reasoning as a Shadow
Jingyuan Ma
Rui Li
Zheng Li
Junfeng Liu
Lei Sha
Lei Sha
Zhifang Sui
AAMLLRM
242
1
0
08 Jun 2025
Debiasing Online Preference Learning via Preference Feature Preservation
Debiasing Online Preference Learning via Preference Feature PreservationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Dongyoung Kim
Jinsung Yoon
Jinwoo Shin
Jaehyung Kim
150
0
0
06 Jun 2025
Normative Conflicts and Shallow AI AlignmentPhilosophical Studies (Philos. Stud.), 2025
Raphaël Millière
195
2
0
05 Jun 2025
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
Yunhao Gou
Kai Chen
Zhili Liu
Lanqing Hong
Xin Jin
Zhenguo Li
James T. Kwok
Yu Zhang
LRM
198
4
0
05 Jun 2025
Misalignment or misuse? The AGI alignment tradeoff
Misalignment or misuse? The AGI alignment tradeoffPhilosophical Studies (Philos. Stud.), 2025
Max Hellrigel-Holderbaum
Leonard Dung
226
2
0
04 Jun 2025
Aligning Large Language Models with Implicit Preferences from User-Generated ContentAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhaoxuan Tan
Zheng Li
Tianyi Liu
Haodong Wang
Hyokun Yun
...
Yifan Gao
Ruijie Wang
Priyanka Nigam
Bing Yin
Meng Jiang
175
6
0
04 Jun 2025
RewardAnything: Generalizable Principle-Following Reward Models
RewardAnything: Generalizable Principle-Following Reward Models
Zhuohao Yu
Jiali Zeng
Weizheng Gu
Yidong Wang
Jindong Wang
Fandong Meng
Jie Zhou
Yue Zhang
Shikun Zhang
Wei Ye
LRM
303
10
0
04 Jun 2025
Beyond Text Compression: Evaluating Tokenizers Across Scales
Beyond Text Compression: Evaluating Tokenizers Across ScalesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jonas F. Lotz
António V. Lopes
Stephan Peitz
Hendra Setiawan
Leonardo Emili
247
2
0
03 Jun 2025
A Trustworthiness-based Metaphysics of Artificial Intelligence Systems
A Trustworthiness-based Metaphysics of Artificial Intelligence SystemsConference on Fairness, Accountability and Transparency (FAccT), 2025
Andrea Ferrario
192
8
0
03 Jun 2025
CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction
CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction
Yudong Lu
Yazhe Niu
Shuai Hu
Haolin Wang
AuLLM
112
1
0
02 Jun 2025
Doubly Robust Alignment for Large Language Models
Doubly Robust Alignment for Large Language Models
Erhan Xu
Kai Ye
Hongyi Zhou
Luhan Zhu
Francesco Quinzan
Chengchun Shi
264
2
0
01 Jun 2025
Aligning Language Models with Observational Data: Opportunities and Risks from a Causal Perspective
Aligning Language Models with Observational Data: Opportunities and Risks from a Causal Perspective
Erfan Loghmani
134
1
0
30 May 2025
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang
Spencer Stice
Ali Payani
Baharan Mirzasoleiman
MLLM
175
0
0
30 May 2025
Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies
Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies
Chenruo Liu
Kenan Tang
Yao Qin
Qi Lei
202
1
0
28 May 2025
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge EditingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yifan Lu
Jing Li
Yigeng Zhou
Yihui Zhang
Wenya Wang
Xiucheng Li
Meishan Zhang
Fangming Liu
Jun-chen Yu
Min Zhang
KELMCLL
194
3
0
28 May 2025
Conversational Alignment with Artificial Intelligence in Context
Conversational Alignment with Artificial Intelligence in ContextPhilosophical Perspectives (PP), 2025
Rachel Katharine Sterken
James Ravi Kirkpatrick
118
2
0
28 May 2025
The Multilingual Divide and Its Impact on Global AI Safety
The Multilingual Divide and Its Impact on Global AI Safety
Aidan Peppin
Julia Kreutzer
Alice Schoenauer Sebag
Kelly Marchisio
Beyza Ermis
...
Wei-Yin Ko
Ahmet Üstün
Matthias Gallé
Marzieh Fadaee
Sara Hooker
ELM
240
1
0
27 May 2025
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
Jiaming Ji
Sitong Fang
Wenjing Cao
Jiahao Li
Xuyao Wang
Juntao Dai
Chi-Min Chan
Sirui Han
Wenhan Luo
Yaodong Yang
LRM
154
0
0
26 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
129
0
0
26 May 2025
Previous
12345...121314
Next