ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.00861
  4. Cited By
A General Language Assistant as a Laboratory for Alignment
v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
    ALM
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 701 papers shown
Title
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
Jiaming Ji
Sitong Fang
Wenjing Cao
Jiahao Li
Xuyao Wang
Juntao Dai
Chi-Min Chan
Sirui Han
Wenhan Luo
Yaodong Yang
LRM
178
0
0
26 May 2025
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Yi Liu
Dianqing Liu
Mingye Zhu
Junbo Guo
Yongdong Zhang
Zhendong Mao
341
0
0
26 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
161
0
0
26 May 2025
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu
Rongzhen Wang
Shen Nie
Xiaolu Zhang
Chunwei Wu
...
Jun Zhou
Jianfei Chen
Yankai Lin
Ji-Rong Wen
Chongxuan Li
417
92
0
25 May 2025
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Steffen Backmann
David Guzman Piedrahita
Emanuel Tewolde
Amélie Reymond
Bernhard Schölkopf
Zhijing Jin
287
4
0
25 May 2025
Incentivizing High-Quality Human Annotations with Golden Questions
Incentivizing High-Quality Human Annotations with Golden Questions
Shang Liu
Zhongze Cai
Hanzhao Wang
Zhongyao Ma
Xiaocheng Li
280
1
0
25 May 2025
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Jiayi Zhou
Jiaming Ji
Boyuan Chen
Jiapeng Sun
Wenqi Chen
Donghai Hong
Sirui Han
Wenhan Luo
Yaodong Yang
216
0
0
24 May 2025
But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors
Leon Eshuijs
Archie Chaudhury
Alan McBeth
Ethan Nguyen
LLMSV
283
0
0
23 May 2025
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Seamus Somerstep
Vinod Raman
Unique Subedi
Yuekai Sun
252
0
0
22 May 2025
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models
Roland Daynauth
Christopher Clarke
Krisztian Flautner
Lingjia Tang
Jason Mars
ALM
150
0
0
21 May 2025
Direct Preference Optimization for Adaptive Concept-based Explanations
Direct Preference Optimization for Adaptive Concept-based Explanations
Jacopo Teneggi
Zhenzhen Wang
Paul H. Yi
Tianmin Shu
Jeremias Sulam
489
0
0
21 May 2025
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Christian Walder
Deep Karkhanis
OffRL
351
17
0
21 May 2025
Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation
Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation
Ruijie Xi
He Ba
Hao Yuan
Rishu Agrawal
Arul Prakash
Ruoyan Long
Arul T. Prakash
SyDa
280
1
0
21 May 2025
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Weixiang Zhao
Xingyu Sui
Yulin Hu
Jiahe Guo
Haixiao Liu
Biye Li
Yanyan Zhao
Bing Qin
Ting Liu
OffRL
432
10
0
21 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
YESciEval: Robust LLM-as-a-Judge for Scientific Question AnsweringAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
433
5
0
20 May 2025
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Yu Ying Chiu
Zhilin Wang
Sharan Maiya
Yejin Choi
Kyle Fish
Sydney Levine
Evan Hubinger
248
6
0
20 May 2025
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Jianwei Li
Jung-Eng Kim
AAML
426
3
0
19 May 2025
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
Austin Xu
Yilun Zhou
Xuan-Phi Nguyen
Caiming Xiong
Shafiq Joty
ELMLRM
516
6
0
19 May 2025
PromptPrism: A Linguistically-Inspired Taxonomy for Prompts
PromptPrism: A Linguistically-Inspired Taxonomy for Prompts
Sullam Jeoung
Yueyan Chen
Yi Zhang
Shuai Wang
Haibo Ding
Lin Lee Cheong
224
1
0
19 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
318
10
0
17 May 2025
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Wenshu Fan
Shengfang Zhai
Mingzhe Du
Yulin Chen
Tri Cao
...
Xuzhao Li
Kun Wang
Junfeng Fang
Jiaheng Zhang
Bryan Hooi
OffRLLRM
255
17
0
16 May 2025
WorldPM: Scaling Human Preference Modeling
WorldPM: Scaling Human Preference Modeling
Binghai Wang
Runji Lin
Keming Lu
Xiaohuan Zhou
Zizhuo Zhang
...
Qi Zhang
Yu-Gang Jiang
Bowen Yu
Jingren Zhou
Junyang Lin
365
5
0
15 May 2025
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
Zhehao Zhang
Weijie Xu
Fanyou Wu
Chandan K. Reddy
352
11
0
12 May 2025
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Zhuocheng Gong
Jian Guan
Wei Wu
Huishuai Zhang
Dongyan Zhao
329
4
0
08 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
336
9
0
05 May 2025
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Jing Liu
Hangyu Guo
Ranjie Duan
Xingyuan Bu
Yancheng He
...
Yingshui Tan
Yanan Wu
Jihao Gu
Yongbin Li
Jun Zhu
MLLM
1.0K
3
0
25 Apr 2025
Safety in Large Reasoning Models: A Survey
Safety in Large Reasoning Models: A Survey
Cheng Wang
Wenshu Fan
Yangqiu Song
Duzhen Zhang
Hao Sun
...
Shengju Yu
Xinfeng Li
Junfeng Fang
Jiaheng Zhang
Bryan Hooi
LRM
978
45
0
24 Apr 2025
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
Saffron Huang
Esin Durmus
Miles McCain
Kunal Handa
Alex Tamkin
Jerry Hong
Michael Stern
Arushi Somani
Xiuruo Zhang
Deep Ganguli
VLM
278
28
0
21 Apr 2025
Antidistillation Sampling
Antidistillation Sampling
Yash Savani
Asher Trockman
Zhili Feng
Avi Schwarzschild
Avi Schwarzschild
Alexander Robey
Marc Finzi
J. Zico Kolter
434
8
0
17 Apr 2025
Evaluating the Goal-Directedness of Large Language Models
Evaluating the Goal-Directedness of Large Language Models
Tom Everitt
Cristina Garbacea
Alexis Bellot
Jonathan G. Richens
Henry Papadatos
Simeon Campos
Rohin Shah
ELMLM&MALM&RoLRM
317
3
0
16 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
LLMSVAAML
365
11
0
13 Apr 2025
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
Jing Yao
Xiaoyuan Yi
Jindong Wang
Zhicheng Dou
Xing Xie
160
4
0
09 Apr 2025
Adversarial Training of Reward Models
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Liang Luo
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
437
5
0
08 Apr 2025
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Pedro Ferreira
Wilker Aziz
Ivan Titov
LRM
342
4
0
07 Apr 2025
Inference-Time Scaling for Generalist Reward Modeling
Inference-Time Scaling for Generalist Reward Modeling
Zijun Liu
P. Wang
Ran Xu
Shirong Ma
Chong Ruan
Ziwei Sun
Yang Liu
Y. Wu
OffRLLRM
478
142
0
03 Apr 2025
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye
Hongyi Zhou
Jin Zhu
Francesco Quinzan
C. Shi
435
5
0
03 Apr 2025
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks
Ali Al-Kaswan
Sebastian Deatc
Begüm Koç
Arie van Deursen
Maliheh Izadi
AAML
290
1
0
02 Apr 2025
InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation
InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory TransformationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Bowen Cao
Deng Cai
W. Lam
CLL
400
3
0
02 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization
PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLMAAML
492
5
0
02 Apr 2025
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Jiale Cheng
Ruiliang Lyu
Xiaohan Zhang
Xiao-Chang Liu
Jiazheng Xu
...
Zhuoyi Yang
Yuxiao Dong
Jie Tang
Han Wang
Minlie Huang
VGen
250
13
0
26 Mar 2025
Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success
Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success
Sophie Hao
ELMAI4CE
238
0
0
25 Mar 2025
The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas
The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas
Giovanni Franco Gabriel Marraffini
Andrés Cotton
Noe Fabian Hsueh
Axel Fridman
Juan Wisznia
Luciano Del Corro
170
6
0
25 Mar 2025
Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts
Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts
Beining Xu
Arkaitz Zubiaga
DeLMO
347
1
0
23 Mar 2025
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Yalan Qin
Xiuying Chen
Rui Pan
Han Zhu
Chen Zhang
...
Chi-Min Chan
Sirui Han
Wenhan Luo
Yiran Yang
Yaodong Yang
OffRL
349
4
0
22 Mar 2025
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World ApplicationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jian Guan
Jian Wu
Jia-Nan Li
Chuanqi Cheng
Wei Wu
LM&MA
722
11
0
21 Mar 2025
Model Risk Management for Generative AI In Financial Institutions
Model Risk Management for Generative AI In Financial Institutions
Anwesha Bhattacharyya
Ye Yu
Hanyu Yang
Rahul Singh
Tarun Joshi
Jie Chen
Kiran Yalavarthy
AIFinMedIm
243
2
0
19 Mar 2025
From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
Jia-Nan Li
Jian Guan
Songhao Wu
Wei Wu
Rui Yan
522
11
0
19 Mar 2025
Training Plug-n-Play Knowledge Modules with Deep Context Distillation
Training Plug-n-Play Knowledge Modules with Deep Context Distillation
Lucas Caccia
Alan Ansell
Edoardo Ponti
Ivan Vulić
Alessandro Sordoni
SyDa
1.1K
4
0
11 Mar 2025
A Multimodal Benchmark Dataset and Model for Crop Disease DiagnosisEuropean Conference on Computer Vision (ECCV), 2025
Xiang Liu
Zhaoxiang Liu
Huan Hu
Zezhou Chen
Kohou Wang
Ning Wang
Kai Wang
212
11
0
10 Mar 2025
RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Wenhui Zhu
Xin Li
Xiwen Chen
Peijie Qiu
Vamsi Krishna Vasa
...
Yanxi Chen
Natasha Lepore
Oana Dumitrascu
Yi Su
Yalin Wang
LM&MA
269
6
0
06 Mar 2025
Previous
123456...131415
Next