ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.00861
  4. Cited By
A General Language Assistant as a Laboratory for Alignment
v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
    ALM
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 701 papers shown
Human Preferences for Constructive Interactions in Language Model Alignment
Human Preferences for Constructive Interactions in Language Model Alignment
Yara Kyrychenko
Jon Roozenbeek
Brandon Davidson
S. V. D. Linden
Ramit Debnath
229
1
0
05 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLLOnRL
397
1
0
03 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
997
5
0
03 Mar 2025
Distributionally Robust Reinforcement Learning with Human Feedback
Debmalya Mandal
Paulius Sasnauskas
Goran Radanović
189
5
0
01 Mar 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
1.0K
3
0
27 Feb 2025
Choices Speak Louder than Questions
Choices Speak Louder than Questions
Gyeongje Cho
Yeonkyoung So
Jaejin Lee
ELM
385
0
0
26 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Shh, don't say that! Domain Certification in LLMsInternational Conference on Learning Representations (ICLR), 2025
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Guohao Li
Juil Sock
Adel Bibi
347
4
0
26 Feb 2025
Advantage-Guided Distillation for Preference Alignment in Small Language Models
Advantage-Guided Distillation for Preference Alignment in Small Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Shiping Gao
Fanqi Wan
Jiajian Guo
Xiaojun Quan
Qifan Wang
ALM
454
4
0
25 Feb 2025
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
Taneesh Gupta
Rahul Madhavan
Xuchao Zhang
Chetan Bansal
Saravan Rajmohan
341
0
0
25 Feb 2025
Larger or Smaller Reward Margins to Select Preferences for Alignment?
Kexin Huang
Junkang Wu
Ziqian Chen
Qingsong Wen
Jinyang Gao
Bolin Ding
Jiancan Wu
Xiangnan He
Xiang Wang
212
2
0
25 Feb 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
772
110
0
24 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
Volkan Cevher
AAML
262
5
0
24 Feb 2025
Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics
Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and EthicsInternational Journal of Computer Applications (IJCA), 2024
Deepak Babu Piskala
Vijay Raajaa
Sachin Mishra
Bruno Bozza
233
6
0
23 Feb 2025
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
Be a Multitude to Itself: A Prompt Evolution Framework for Red TeamingConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Rui Li
Peiyi Wang
Jingyuan Ma
Di Zhang
Lei Sha
Lei Sha
LLMAG
319
0
0
22 Feb 2025
An LLM-Based Approach for Insight Generation in Data Analysis
An LLM-Based Approach for Insight Generation in Data AnalysisNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Alberto Sánchez Pérez
Alaa Boukhary
Paolo Papotti
Luis Castejón Lozano
Adam Elwood
232
11
0
20 Feb 2025
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment
Faster WIND: Accelerating Iterative Best-of-NNN Distillation for LLM AlignmentInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Tong Yang
Jincheng Mei
H. Dai
Zixin Wen
Shicong Cen
Dale Schuurmans
Yuejie Chi
Bo Dai
371
6
0
20 Feb 2025
STaR-SQL: Self-Taught Reasoner for Text-to-SQL
STaR-SQL: Self-Taught Reasoner for Text-to-SQLAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Mingqian He
Yongliang Shen
Weinan Zhang
Qiuying Peng
Jun Wang
Weiming Lu
ReLMLRM
198
10
0
20 Feb 2025
Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning
Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yilei Tu
Andrew Xue
Freda Shi
394
1
0
17 Feb 2025
Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding
Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect UnderstandingInternational Journal of Computer Vision (IJCV), 2025
Thanh-Dat Truong
Hoang-Quan Nguyen
Xuan-Bac Nguyen
Ashley Dowling
Pawan Sinha
Khoa Luu
MLLM
216
6
0
17 Feb 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning CapabilitiesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Fengqing Jiang
Zhangchen Xu
Yuetai Li
Luyao Niu
Zhen Xiang
Yue Liu
Bill Yuchen Lin
Radha Poovendran
KELMELMLRM
286
76
0
17 Feb 2025
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Yingshui Tan
Yilei Jiang
Yongbin Li
Qingbin Liu
Xingyuan Bu
Yuchi Xu
Xiangyu Yue
Xiaoyong Zhu
Bo Zheng
ALM
412
16
0
17 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
523
17
0
16 Feb 2025
LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits
LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits
Zikai Zhou
Qizheng Zhang
Hermann Kumbong
Kunle Olukotun
MQ
1.2K
3
0
12 Feb 2025
Trustworthy AI: Safety, Bias, and Privacy -- A Survey
Trustworthy AI: Safety, Bias, and Privacy -- A Survey
Xingli Fang
Jianwei Li
Varun Mulchandani
Jung-Eun Kim
376
0
0
11 Feb 2025
Safety Reasoning with Guidelines
Safety Reasoning with Guidelines
Haoyu Wang
Zeyu Qin
Li Shen
Xueqian Wang
Minhao Cheng
Dacheng Tao
457
4
0
06 Feb 2025
Evaluation of Large Language Models via Coupled Token Generation
Evaluation of Large Language Models via Coupled Token Generation
N. C. Benz
Stratis Tsirtsis
Eleni Straitouri
Ivi Chatzi
Ander Artola Velasco
Suhas Thejaswi
Manuel Gomez Rodriguez
367
3
0
03 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun Xia
Tianyi Wu
...
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
Hui Xiong
Bryan Hooi
AI4TSLRM
592
55
0
30 Jan 2025
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment BenchmarkingInternational Conference on Learning Representations (ICLR), 2024
Benjamin Feuer
Micah Goldblum
Teresa Datta
Sanjana Nambiar
Raz Besaleli
Samuel Dooley
Max Cembalest
John P. Dickerson
ALM
378
0
0
28 Jan 2025
Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product GapBigData Congress [Services Society] (BSS), 2024
Srivatsa Mallapragada
Ying Xie
Varsha Rani Chawan
Zeyad Hailat
Yuanbo Wang
285
2
0
28 Jan 2025
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language ModelsKnowledge Discovery and Data Mining (KDD), 2023
Jingwei Yi
Yueqi Xie
Bin Zhu
Emre Kiciman
Guangzhong Sun
Xing Xie
Fangzhao Wu
AAML
452
149
0
28 Jan 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Ziyu Liu
...
Haodong Duan
Feiyu Xiong
Kai Chen
Dahua Lin
Jiaqi Wang
VLM
593
46
0
21 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesNeural Information Processing Systems (NeurIPS), 2024
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
423
86
0
20 Jan 2025
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference RankingsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Tong Liu
Xiao Yu
Wenxuan Zhou
Jindong Gu
Volker Tresp
422
3
0
11 Jan 2025
Predictable Artificial Intelligence
Predictable Artificial Intelligence
Lexin Zhou
Pablo Antonio Moreno Casares
Fernando Martínez-Plumed
John Burden
Ryan Burnell
...
Seán Ó hÉigeartaigh
Danaja Rutar
Wout Schellaert
Konstantinos Voudouris
José Hernández-Orallo
508
8
0
08 Jan 2025
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Yueqin Yin
Shentao Yang
Yujia Xie
Ziyi Yang
Yuting Sun
Hany Awadalla
Weizhu Chen
Mingyuan Zhou
327
5
0
07 Jan 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Ruosen Li
Teerth Patel
Xinya Du
LLMAGALM
551
126
0
03 Jan 2025
Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models
Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models
Jie M. Zhang
Xiangkui Cao
Zhouyu Han
Shiguang Shan
Xilin Chen
ELM
254
0
0
27 Dec 2024
Lies, Damned Lies, and Distributional Language Statistics: Persuasion
  and Deception with Large Language Models
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models
Cameron R. Jones
Benjamin Bergen
460
13
0
22 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in
  IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
318
5
0
22 Dec 2024
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
589
7
0
19 Dec 2024
Generative Prompt Internalization
Generative Prompt InternalizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Haebin Shin
Lei Ji
Yeyun Gong
Sungdong Kim
Eunbi Choi
Minjoon Seo
206
0
0
24 Nov 2024
Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
Shang Liu
Yu Pan
Guanting Chen
Xiaocheng Li
385
3
0
19 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Xianpei Han
...
Le Sun
Jie Lou
Bowen Yu
Yaojie Lu
Hongyu Lin
ALM
587
8
0
18 Nov 2024
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Caspar Oesterheld
Emery Cooper
Miles Kodama
Linh Chi Nguyen
Ethan Perez
546
1
0
15 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Beyond the Safety Bundle: Auditing the Helpful and Harmless DatasetNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
573
7
0
12 Nov 2024
Can LLMs make trade-offs involving stipulated pain and pleasure states?
Can LLMs make trade-offs involving stipulated pain and pleasure states?
Geoff Keeling
Winnie Street
Martyna Stachaczyk
Daria Zakharova
Iulia M. Comsa
Anastasiya Sakovych
Isabella Logothesis
Zejia Zhang
Blaise Agüera y Arcas
Jonathan Birch
218
11
0
01 Nov 2024
Effective and Efficient Adversarial Detection for Vision-Language Models
  via A Single Vector
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang
Fengbin Zhu
Jingkun Tang
Pan Zhou
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
AAML
164
5
0
30 Oct 2024
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization
fff-PO: Generalizing Preference Optimization with fff-divergence MinimizationInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Jiaqi Han
Mingjian Jiang
Yuxuan Song
J. Leskovec
Stefano Ermon
386
9
0
29 Oct 2024
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
Lize Alberts
Benjamin Ellis
Andrei Lupu
Jakob Foerster
ELM
307
2
0
28 Oct 2024
Transferable Post-training via Inverse Value Learning
Transferable Post-training via Inverse Value LearningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Xinyu Lu
Xueru Wen
Yaojie Lu
Bowen Yu
Hongyu Lin
Haiyang Yu
Le Sun
Jia Zheng
Yongbin Li
236
1
0
28 Oct 2024
Previous
12345...131415
Next