Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2112.00861
Cited By
v1
v2
v3 (latest)
A General Language Assistant as a Laboratory for Alignment
1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"A General Language Assistant as a Laboratory for Alignment"
50 / 701 papers shown
A safety realignment framework via subspace-oriented model fusion for large language models
Knowledge-Based Systems (KBS), 2024
Xin Yi
Shunfan Zheng
Linlin Wang
Xiaoling Wang
Xiaoling Wang
220
40
0
15 May 2024
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Raghuveer Peri
Sai Muralidhar Jayanthi
S. Ronanki
Anshu Bhatia
Karel Mundnich
...
Srikanth Vishnubhotla
Daniel Garcia-Romero
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
AAML
283
7
0
14 May 2024
LLM Theory of Mind and Alignment: Opportunities and Risks
Winnie Street
139
17
0
13 May 2024
Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre
Boyd Branch
Piotr Wojciech Mirowski
Kory W. Mathewson
Sophia Ppali
A. Covaci
235
6
0
11 May 2024
Value Augmented Sampling for Language Model Alignment and Personalization
Seungwook Han
Idan Shenfeld
Akash Srivastava
Yoon Kim
Pulkit Agrawal
OffRL
248
40
0
10 May 2024
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
202
5
0
08 May 2024
Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks
Georgios Pantazopoulos
Amit Parekh
Malvina Nikandrou
Alessandro Suglia
267
13
0
07 May 2024
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
Piotr Padlewski
Max Bain
Matthew Henderson
Zhongkai Zhu
Nishant Relan
...
Che Zheng
Cyprien de Masson dÁutume
Dani Yogatama
Mikel Artetxe
Yi Tay
VLM
246
31
0
03 May 2024
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Seungone Kim
Juyoung Suk
Shayne Longpre
Bill Yuchen Lin
Jamin Shin
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
MoMe
ALM
ELM
375
328
0
02 May 2024
Self-Play Preference Optimization for Language Model Alignment
Yue Wu
Zhiqing Sun
Huizhuo Yuan
Kaixuan Ji
Yiming Yang
Quanquan Gu
586
207
0
01 May 2024
RepEval: Effective Text Evaluation with LLM Representation
Shuqian Sheng
Yi Xu
Tianhang Zhang
Zanwei Shen
Luoyi Fu
Jiaxin Ding
Lei Zhou
Xinbing Wang
Cheng Zhou
189
7
0
30 Apr 2024
The AI Companion in Education: Analyzing the Pedagogical Potential of ChatGPT in Computer Science and Engineering
Z. He
Thomas Nguyen
Tahereh Miari
Mehrdad Aliasgari
S. Rafatirad
Hossein Sayadi
108
10
0
23 Apr 2024
Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
Jan-Philipp Fränken
E. Zelikman
Rafael Rafailov
Kanishk Gandhi
Tobias Gerstenberg
Noah D. Goodman
266
19
0
22 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
347
235
0
19 Apr 2024
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Aitor Ormazabal
Che Zheng
Cyprien de Masson dÁutume
Dani Yogatama
Deyu Fu
...
Yazheng Yang
Yi Tay
Yuqi Wang
Zhongkai Zhu
Zhihui Xie
LRM
VLM
ReLM
265
64
0
18 Apr 2024
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
Chandeepa Dissanayake
Lahiru Lowe
Sachith Gunasekara
Yasiru Ratnayake
MoE
ALM
172
3
0
18 Apr 2024
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
Ruoxi Cheng
Haoxuan Ma
Shuirong Cao
Jiaqi Li
Aihua Pei
Zhiqiang Wang
Pengliang Ji
Haoyu Wang
Jiaqi Huo
AI4CE
437
21
0
15 Apr 2024
Explainable Generative AI (GenXAI): A Survey, Conceptualization, and Research Agenda
Johannes Schneider
259
78
0
15 Apr 2024
Towards Practical Tool Usage for Continually Learning LLMs
Jerry Huang
Prasanna Parthasarathi
Mehdi Rezagholizadeh
Sarath Chandar
CLL
KELM
201
8
0
14 Apr 2024
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari
Pranjal Aggarwal
Vishvak Murahari
Tanmay Rajpurohit
Ashwin Kalyan
Karthik Narasimhan
Ameet Deshpande
Bruno Castro da Silva
407
88
0
12 Apr 2024
Best Practices and Lessons Learned on Synthetic Data for Language Models
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
...
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
SyDa
EgoV
304
112
0
11 Apr 2024
Laissez-Faire Harms: Algorithmic Biases in Generative Language Models
Evan Shieh
Faye-Marie Vassel
Cassidy R. Sugimoto
T. Monroe-White
207
7
0
11 Apr 2024
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu
Yuge Tu
Xu Han
Chaoqun He
Ganqu Cui
...
Chaochao Jia
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun
MoE
446
557
0
09 Apr 2024
Towards Understanding the Influence of Reward Margin on Preference Model Performance
Bowen Qin
Duanyu Feng
Xi Yang
143
7
0
07 Apr 2024
Aligning Diffusion Models by Optimizing Human Utility
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Yusuke Kato
Kazuki Kozuka
305
67
0
06 Apr 2024
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yifan Xu
Xiao Liu
Xinghan Liu
Zhenyu Hou
Yueyan Li
...
Aohan Zeng
Zhengxiao Du
Wenyi Zhao
Jie Tang
Yuxiao Dong
LRM
231
55
0
03 Apr 2024
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models
Haoran Sun
Lixin Liu
Junjie Li
Fengyu Wang
Baohua Dong
Ran Lin
Ruohui Huang
198
23
0
03 Apr 2024
Calibrating the Confidence of Large Language Models by Eliciting Fidelity
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Mozhi Zhang
Mianqiu Huang
Rundong Shi
Linsen Guo
Chong Peng
Peng Yan
Yaqian Zhou
Xipeng Qiu
277
33
0
03 Apr 2024
HyperCLOVA X Technical Report
Kang Min Yoo
Jaegeun Han
Sookyo In
Heewon Jeon
Jisu Jeong
...
Hyunkyung Noh
Se-Eun Choi
Sang-Woo Lee
Jung Hwa Lim
Nako Sung
VLM
232
11
0
02 Apr 2024
Efficient Prompting Methods for Large Language Models: A Survey
Kaiyan Chang
Songcheng Xu
Chenglong Wang
Yingfeng Luo
Tong Xiao
Jingbo Zhu
LRM
406
47
0
01 Apr 2024
A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias
Yuemei Xu
Ling Hu
Jiayi Zhao
Zihan Qiu
Yuqi Ye
Hanwen Gu
LRM
444
91
0
01 Apr 2024
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
Hritik Bansal
Ashima Suvarna
Gantavya Bhatt
Nanyun Peng
Kai-Wei Chang
Aditya Grover
ALM
416
16
0
31 Mar 2024
Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model
Qi Gou
Cam-Tu Nguyen
309
22
0
28 Mar 2024
Fine-Tuning Language Models with Reward Learning on Policy
Hao Lang
Fei Huang
Yongbin Li
ALM
239
13
0
28 Mar 2024
IterAlign: Iterative Constitutional Alignment of Large Language Models
Xiusi Chen
Hongzhi Wen
Jiapeng Liu
Chen Luo
Qingyu Yin
Ruirui Li
Zheng Li
Wei Wang
AILaw
118
7
0
27 Mar 2024
Assessment of Multimodal Large Language Models in Alignment with Human Values
Zhelun Shi
Zhipin Wang
Hongxing Fan
Zaibin Zhang
Lijun Li
Yongting Zhang
Zhen-fei Yin
Lu Sheng
Yu Qiao
Jing Shao
230
34
0
26 Mar 2024
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Erik Miehling
Manish Nagireddy
P. Sattigeri
Elizabeth M. Daly
David Piorkowski
John T. Richards
ALM
377
24
0
22 Mar 2024
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert
Valentina Pyatkin
Jacob Morrison
Lester James V. Miranda
Bill Yuchen Lin
...
Sachin Kumar
Tom Zick
Yejin Choi
Noah A. Smith
Hanna Hajishirzi
ALM
468
335
0
20 Mar 2024
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yi Luo
Zheng-Wen Lin
Yuhao Zhang
Jiashuo Sun
Chen Lin
Chengjin Xu
Xiangdong Su
Haoran Pan
Jian Guo
Yeyun Gong
LM&MA
ELM
ALM
AI4TS
183
2
0
18 Mar 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
European Conference on Computer Vision (ECCV), 2024
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
497
93
0
14 Mar 2024
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback
Ang Li
Qiugen Xiao
Peng Cao
Jian Tang
Yi Yuan
...
Weidong Guo
Yukang Gan
Jeffrey Xu Yu
D. Wang
Ying Shan
VLM
ALM
334
13
0
13 Mar 2024
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards
Wei Shen
Xiaoying Zhang
Yuanshun Yao
Rui Zheng
Hongyi Guo
Yang Liu
ALM
236
24
0
12 Mar 2024
Matrix-Transformation Based Low-Rank Adaptation (MTLoRA): A Brain-Inspired Method for Parameter-Efficient Fine-Tuning
Yao Liang
Yuwei Wang
Yang Li
Yi Zeng
224
2
0
12 Mar 2024
MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning
International Conference on Learning Representations (ICLR), 2024
Yichuan Li
Xiyao Ma
Sixing Lu
Kyumin Lee
Xiaohu Liu
Chenlei Guo
204
14
0
11 Mar 2024
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang
Shitong Duan
Xiaoyuan Yi
Jing Yao
Shanlin Zhou
Zhihua Wei
Peng Zhang
Dongkuan Xu
Maosong Sun
Xing Xie
OffRL
396
22
0
07 Mar 2024
Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
Shitong Duan
Xiaoyuan Yi
Peng Zhang
Tun Lu
Xing Xie
Ning Gu
158
6
0
06 Mar 2024
Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF
Chen Zheng
Ke Sun
Hang Wu
Chenguang Xi
Xun Zhou
217
15
0
04 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Arijit Ghosh Chowdhury
Md. Mofijul Islam
Vaibhav Kumar
F. H. Shezan
Vaibhav Kumar
Vinija Jain
Vasu Sharma
AAML
PILM
275
46
0
03 Mar 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu
Pin-Yu Chen
Tsung-Yi Ho
AAML
196
53
0
01 Mar 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Haoxiang Wang
Yong Lin
Wei Xiong
Rui Yang
Shizhe Diao
Delin Qu
Han Zhao
Tong Zhang
428
125
0
28 Feb 2024
Previous
1
2
3
...
7
8
9
...
13
14
15
Next