Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2112.00861
Cited By
v1
v2
v3 (latest)
A General Language Assistant as a Laboratory for Alignment
1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"A General Language Assistant as a Laboratory for Alignment"
50 / 701 papers shown
Constraining Participation: Affordances of Feedback Features in Interfaces to Large Language Models
ACM Journal on Responsible Computing (ACM JRC), 2024
Ned Cooper
Alexandra Zafiroglu
238
0
0
27 Aug 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
International Conference on Learning Representations (ICLR), 2024
Wenxuan Zhang
Juil Sock
Mohamed Elhoseiny
Adel Bibi
494
23
0
27 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
504
60
0
23 Aug 2024
Value Alignment from Unstructured Text
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Inkit Padhi
Karthikeyan N. Ramamurthy
P. Sattigeri
Manish Nagireddy
Pierre Dognin
Kush R. Varshney
227
0
0
19 Aug 2024
Minor DPO reject penalty to increase training robustness
Shiming Xie
Hong Chen
Fred Yu
Zeye Sun
Xiuyu Wu
Yingfan Hu
202
5
0
19 Aug 2024
Offline RLHF Methods Need More Accurate Supervision Signals
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Shiqi Wang
Zhengze Zhang
Rui Zhao
Fei Tan
Cam Tu Nguyen
OffRL
86
0
0
18 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
168
5
0
18 Aug 2024
SEAL: Systematic Error Analysis for Value ALignment
AAAI Conference on Artificial Intelligence (AAAI), 2024
Manon Revel
Matteo Cargnelutti
Tyna Eloundou
Greg Leppert
286
6
0
16 Aug 2024
The Future of Open Human Feedback
Nature Machine Intelligence (Nat. Mach. Intell.), 2024
Shachar Don-Yehiya
Ben Burtenshaw
Ramon Fernandez Astudillo
Cailean Osborne
Mimansa Jaiswal
...
Omri Abend
Jennifer Ding
Sara Hooker
Hannah Rose Kirk
Leshem Choshen
VLM
ALM
280
9
0
15 Aug 2024
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
International Conference on Learning Representations (ICLR), 2024
Yuxin Jiang
Bo Huang
Yufei Wang
Xingshan Zeng
Liangyou Li
Yasheng Wang
Xin Jiang
Lifeng Shang
Ruiming Tang
Wei Wang
306
4
0
14 Aug 2024
Building Decision Making Models Through Language Model Regime
Yu Zhang
Haoxiang Liu
Feijun Jiang
Weihua Luo
Kaifu Zhang
171
2
0
12 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Neural Information Processing Systems (NeurIPS), 2024
Jingtong Su
Mingyu Lee
SangKeun Lee
207
22
0
02 Aug 2024
ABC Align: Large Language Model Alignment for Safety & Accuracy
Gareth Seneque
Lap-Hang Ho
Peter W. Glynn
Yinyu Ye
Jeffrey Molendijk
190
1
0
01 Aug 2024
LLMmap: Fingerprinting For Large Language Models
Dario Pasquini
Evgenios M. Kornaropoulos
G. Ateniese
505
22
0
22 Jul 2024
Improving Context-Aware Preference Modeling for Language Models
Silviu Pitis
Ziang Xiao
Nicolas Le Roux
Alessandro Sordoni
208
21
0
20 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
432
41
0
20 Jul 2024
Learning Goal-Conditioned Representations for Language Reward Models
Vaskar Nath
Dylan Slack
Jeff Da
Yuntao Ma
Hugh Zhang
Spencer Whitehead
Sean Hendryx
182
0
0
18 Jul 2024
The Better Angels of Machine Personality: How Personality Relates to LLM Safety
Jie Zhang
Dongrui Liu
Chao Qian
Ziyue Gan
Yong Liu
Yu Qiao
Jing Shao
LLMAG
PILM
224
20
0
17 Jul 2024
How Are LLMs Mitigating Stereotyping Harms? Learning from Search Engine Studies
Alina Leidinger
Richard Rogers
390
19
0
16 Jul 2024
Thorns and Algorithms: Navigating Generative AI Challenges Inspired by Giraffes and Acacias
Waqar Hussain
274
1
0
16 Jul 2024
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Huanqian Wang
Yang Yue
Rui Lu
Jingxin Shi
Andrew Zhao
Shenzhi Wang
Shiji Song
Gao Huang
LM&Ro
KELM
423
16
0
11 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
219
33
0
10 Jul 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
Jinseok Kim
Jaewon Jung
Sangyeop Kim
S. Park
Sungzoon Cho
154
2
0
09 Jul 2024
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
Junsoo Park
Seungyeon Jwa
Meiying Ren
Daeyoung Kim
Sanghyuk Choi
ALM
300
75
0
09 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
349
37
0
06 Jul 2024
Spontaneous Reward Hacking in Iterative Self-Refinement
Jane Pan
He He
Samuel R. Bowman
Shi Feng
257
17
0
05 Jul 2024
Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment
Janghwan Lee
Seongmin Park
S. Hong
Minsoo Kim
Du-Seong Chang
Jungwook Choi
117
10
0
03 Jul 2024
RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs
John Dang
Arash Ahmadian
Kelly Marchisio
Julia Kreutzer
Ahmet Üstün
Sara Hooker
247
43
0
02 Jul 2024
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
221
1
0
01 Jul 2024
Self-Cognition in Large Language Models: An Exploratory Study
Dongping Chen
Jiawen Shi
Yao Wan
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
LRM
LLMAG
224
10
0
01 Jul 2024
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
Tzu-Han Lin
Chen-An Li
Hung-yi Lee
Yun-Nung Chen
VLM
ALM
139
6
0
01 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
134
6
0
01 Jul 2024
BAPO: Base-Anchored Preference Optimization for Personalized Alignment in Large Language Models
Gihun Lee
Minchan Jeong
Yujin Kim
Hojung Jung
Jaehoon Oh
Sangmook Kim
Se-Young Yun
256
0
0
30 Jun 2024
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning
Mingqian He
Yongliang Shen
Wenqi Zhang
Zeqi Tan
Weiming Lu
LRM
226
13
0
29 Jun 2024
Rethinking harmless refusals when fine-tuning foundation models
Florin Pop
Judd Rosenblatt
Diogo Schwerz de Lucena
Michael Vaiana
82
0
0
27 Jun 2024
Suri: Multi-constraint Instruction Following for Long-form Text Generation
Chau Minh Pham
Simeng Sun
Mohit Iyyer
ALM
LRM
281
35
0
27 Jun 2024
AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
Adam Dahlgren Lindstrom
Leila Methnani
Lea Krause
Petter Ericson
Ínigo Martínez de Rituerto de Troya
Dimitri Coelho Mollo
Roel Dobbe
ALM
208
7
0
26 Jun 2024
PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
Shiva K. Pentyala
Zhichao Wang
Bin Bi
Kiran Ramnath
Xiang-Bo Mao
Regunathan Radhakrishnan
S. Asur
Na
Cheng
MoMe
249
12
0
25 Jun 2024
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang
Jiaao Chen
Diyi Yang
LRM
222
24
0
25 Jun 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
311
32
0
24 Jun 2024
Guardrails for avoiding harmful medical product recommendations and off-label promotion in generative AI models
Daniel Lopez-Martinez
MedIm
287
3
0
24 Jun 2024
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Deng Cai
Huayang Li
Tingchen Fu
Siheng Li
Weiwen Xu
...
Leyang Cui
Yan Wang
Lemao Liu
Taro Watanabe
Shuming Shi
KELM
232
2
0
24 Jun 2024
Large Language Models Assume People are More Rational than We Really are
Ryan Liu
Jiayi Geng
Joshua C. Peterson
Ilia Sucholutsky
Thomas Griffiths
525
35
0
24 Jun 2024
How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions
Julia Kharchenko
Tanya Roosta
Aman Chadha
Chirag Shah
222
44
0
21 Jun 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Ziang Xiao
Shu Wang
Xing Xie
ELM
ALM
601
11
0
20 Jun 2024
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
LLMSV
KELM
338
26
0
20 Jun 2024
FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering
Tianchi Cai
Zhiwen Tan
Xierui Song
Tao Sun
Jiyan Jiang
Yunqi Xu
Yinger Zhang
Jinjie Gu
287
17
0
19 Jun 2024
In-Context Former: Lightning-fast Compressing Context for Large Language Model
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Xiangfeng Wang
Zaiyi Chen
Zheyong Xie
Tong Xu
Yongyi He
Enhong Chen
202
9
0
19 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
296
11
0
19 Jun 2024
Low-Redundant Optimization for Large Language Model Alignment
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zhipeng Chen
Kun Zhou
Wayne Xin Zhao
Jingyuan Wang
Ji-Rong Wen
248
0
0
18 Jun 2024
Previous
1
2
3
...
5
6
7
...
13
14
15
Next