ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.00861
  4. Cited By
A General Language Assistant as a Laboratory for Alignment
v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
T. Henighan
Andy Jones
Nicholas Joseph
Benjamin Mann
Nova Dassarma
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
John Kernion
Kamal Ndousse
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
    ALM
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 701 papers shown
Transferable Post-training via Inverse Value Learning
Transferable Post-training via Inverse Value LearningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Xinyu Lu
Xueru Wen
Yaojie Lu
Bowen Yu
Hongyu Lin
Haiyang Yu
Le Sun
Jia Zheng
Yongbin Li
236
1
0
28 Oct 2024
Rethinking the Uncertainty: A Critical Review and Analysis in the Era of
  Large Language Models
Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models
Mohammad Beigi
Sijia Wang
Ying Shen
Zihao Lin
Adithya Kulkarni
...
Ming Jin
Jin-Hee Cho
Dawei Zhou
Chang-Tien Lu
Lifu Huang
277
3
0
26 Oct 2024
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
Rishabh Agarwal
Rameswar Panda
OffRL
571
38
0
23 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language
  Models
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
262
3
0
17 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by
  Introspection
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELMAIFinLRM
251
39
0
17 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Ziqiang Liu
Shiwei Li
...
Yiming Lei
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
425
37
0
16 Oct 2024
Exploring Model Kinship for Merging Large Language Models
Exploring Model Kinship for Merging Large Language Models
Yedi Hu
Yunzhi Yao
Ningyu Zhang
Shumin Deng
Ningyu Zhang
MoMe
477
1
0
16 Oct 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
JudgeBench: A Benchmark for Evaluating LLM-based JudgesInternational Conference on Learning Representations (ICLR), 2024
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELMALM
713
146
0
16 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation SteeringInternational Conference on Learning Representations (ICLR), 2024
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
443
62
0
15 Oct 2024
DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action Alignment
DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action AlignmentIEEE International Conference on Robotics and Automation (ICRA), 2024
Wendi Chen
Han Xue
Fangyuan Zhou
Yuan Fang
Cewu Lu
323
4
0
15 Oct 2024
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large
  Language and Vision Models
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
Juseong Jin
Chang Wook Jeong
271
8
0
13 Oct 2024
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
RMB: Comprehensively Benchmarking Reward Models in LLM AlignmentInternational Conference on Learning Representations (ICLR), 2024
Enyu Zhou
Guodong Zheng
Binghai Wang
Zhiheng Xi
Jiajun Sun
...
Yurong Mou
Rui Zheng
Tao Gui
Tao Gui
Xuanjing Huang
ALM
428
43
0
13 Oct 2024
Unraveling and Mitigating Safety Alignment Degradation of
  Vision-Language Models
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Qin Liu
Chao Shang
Ling Liu
Nikolaos Pappas
Jie Ma
Neha Anna John
Srikanth Doss Kadarundalagi Raghuram Doss
Lluís Marquez
Miguel Ballesteros
Yassine Benajiba
283
15
0
11 Oct 2024
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang
Zhihan Liu
Boyi Liu
Yanzhe Zhang
Yingxiang Yang
Yunxing Liu
Liyu Chen
Tao Sun
Ziyi Wang
634
5
0
10 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference OptimizationInternational Conference on Learning Representations (ICLR), 2024
Yougang Lyu
Lingyong Yan
Zihan Wang
D. Yin
Sudipta Singha Roy
Maarten de Rijke
Zhaochun Ren
588
15
0
10 Oct 2024
Self-Boosting Large Language Models with Synthetic Preference Data
Self-Boosting Large Language Models with Synthetic Preference DataInternational Conference on Learning Representations (ICLR), 2024
Qingxiu Dong
Li Dong
Xingxing Zhang
Zhifang Sui
Furu Wei
SyDa
242
29
0
09 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
599
1
0
09 Oct 2024
WAPITI: A Watermark for Finetuned Open-Source LLMs
WAPITI: A Watermark for Finetuned Open-Source LLMs
Lingjie Chen
Ruizhong Qiu
Siyu Yuan
Zhining Liu
Tianxin Wei
Hyunsik Yoo
Zhichen Zeng
Deqing Yang
Hanghang Tong
WaLM
299
13
0
09 Oct 2024
Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning
Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement LearningNeural Information Processing Systems (NeurIPS), 2024
Hao Ma
Tianyi Hu
Zhiqiang Pu
Boyin Liu
Xiaolin Ai
Yanyan Liang
Min Chen
394
22
0
08 Oct 2024
Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation
Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation
Chaithanya Bandi
Abir Harrasse
LLMAGELM
265
11
0
07 Oct 2024
Superficial Safety Alignment Hypothesis
Superficial Safety Alignment Hypothesis
Jianwei Li
Jung-Eun Kim
330
5
0
07 Oct 2024
Is Safer Better? The Impact of Guardrails on the Argumentative Strength
  of LLMs in Hate Speech Countering
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech CounteringConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Helena Bonaldi
Greta Damo
Nicolás Benjamín Ocampo
Elena Cabrio
S. Villata
Marco Guerini
172
12
0
04 Oct 2024
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
MA-RLHF: Reinforcement Learning from Human Feedback with Macro ActionsInternational Conference on Learning Representations (ICLR), 2024
Yekun Chai
Haoran Sun
Huang Fang
Shuohuan Wang
Yu Sun
Hua Wu
979
5
0
03 Oct 2024
CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning
CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning
Huimu Yu
Xing Wu
Weidong Yin
Debing Zhang
Songlin Hu
LRM
314
7
0
03 Oct 2024
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily LifeInternational Conference on Learning Representations (ICLR), 2024
Yu Ying Chiu
Liwei Jiang
Yejin Choi
326
25
0
03 Oct 2024
Erasing Conceptual Knowledge from Language Models
Erasing Conceptual Knowledge from Language Models
Rohit Gandikota
Sheridan Feucht
Samuel Marks
David Bau
ELMKELMMU
444
20
0
03 Oct 2024
Generative Reward Models
Generative Reward Models
Dakota Mahan
Duy Phung
Rafael Rafailov
Chase Blagden
Nathan Lile
Louis Castricato
Jan-Philipp Fränken
Chelsea Finn
Alon Albalak
VLMSyDaOffRL
245
82
0
02 Oct 2024
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown
Xingzhou Lou
Dong Yan
Wei Shen
Yuzi Yan
Jian Xie
Junge Zhang
413
44
0
01 Oct 2024
Wait, but Tylenol is Acetaminophen... Investigating and Improving
  Language Models' Ability to Resist Requests for Misinformation
Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation
Shan Chen
Mingye Gao
Kuleen Sasse
Thomas Hartvigsen
Brian Anthony
Lizhou Fan
Hugo J. W. L. Aerts
Jack Gallifant
Danielle S. Bitterman
LM&MA
252
2
0
30 Sep 2024
The Perfect Blend: Redefining RLHF with Mixture of Judges
The Perfect Blend: Redefining RLHF with Mixture of Judges
Tengyu Xu
Eryk Helenowski
Karthik Abinav Sankararaman
Di Jin
Kaiyan Peng
...
Gabriel Cohen
Yuandong Tian
Hao Ma
Sinong Wang
Han Fang
386
26
0
30 Sep 2024
A Survey on the Honesty of Large Language Models
A Survey on the Honesty of Large Language Models
Siheng Li
Cheng Yang
Taiqiang Wu
Chufan Shi
Yuji Zhang
...
Jie Zhou
Yujiu Yang
Ngai Wong
Xixin Wu
Wai Lam
HILM
297
18
0
27 Sep 2024
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Yifan Jiang
Kriti Aggarwal
Tanmay Laud
Kashif Munir
Jay Pujara
Subhabrata Mukherjee
AAML
239
13
0
26 Sep 2024
Exposing Assumptions in AI Benchmarks through Cognitive Modelling
Exposing Assumptions in AI Benchmarks through Cognitive Modelling
Jonathan H. Rystrøm
Kenneth C. Enevoldsen
189
0
0
25 Sep 2024
Holistic Automated Red Teaming for Large Language Models through
  Top-Down Test Case Generation and Multi-turn Interaction
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jinchuan Zhang
Yan Zhou
Yaxin Liu
Ziming Li
Songlin Hu
AAML
222
15
0
25 Sep 2024
RMCBench: Benchmarking Large Language Models' Resistance to Malicious
  Code
RMCBench: Benchmarking Large Language Models' Resistance to Malicious CodeInternational Conference on Automated Software Engineering (ASE), 2024
Jiachi Chen
Qingyuan Zhong
Yanlin Wang
Kaiwen Ning
Yongkun Liu
Zenan Xu
Zhe Zhao
Ting Chen
Zibin Zheng
AAML
121
33
0
23 Sep 2024
Direct Judgement Preference Optimization
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
374
23
0
23 Sep 2024
Contextualized AI for Cyber Defense: An Automated Survey using LLMs
Contextualized AI for Cyber Defense: An Automated Survey using LLMsInternational Conference on Security of Information and Networks (SIN), 2024
Christoforus Yoga Haryanto
Anne Maria Elvira
Trung Duc Nguyen
Minh Hieu Vu
Yoshiano Hartanto
Emily Lomempow
Arathi Arakala
200
2
0
20 Sep 2024
Contextual Compression in Retrieval-Augmented Generation for Large
  Language Models: A Survey
Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey
Sourav Verma
RALM3DV
261
7
0
20 Sep 2024
Aligning Language Models Using Follow-up Likelihood as Reward Signal
Aligning Language Models Using Follow-up Likelihood as Reward SignalAAAI Conference on Artificial Intelligence (AAAI), 2024
Chen Zhang
Dading Chong
Feng Jiang
Chengguang Tang
Anningzhe Gao
Guohua Tang
Haizhou Li
ALM
321
6
0
20 Sep 2024
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for FilipinoPacific Asia Conference on Language, Information and Computation (PACLIC), 2024
Jann Railey Montalan
Jian Gang Ngui
Wei Qi Leong
Yosephine Susanto
Hamsawardhini Rengarajan
William-Chandra Tjhi
Alham Fikri Aji
452
6
0
20 Sep 2024
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language ModelsThe Web Conference (WWW), 2024
Peiyi Zhang
Yazhou Zhang
Bo Wang
Lu Rong
Jing Qin
Jing Qin
AI4EdELM
373
6
0
19 Sep 2024
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM AgentsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Zhe Su
Xuhui Zhou
Sanketh Rangreji
Anubha Kabra
Julia Mendelsohn
Faeze Brahman
Maarten Sap
LLMAG
396
20
0
13 Sep 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Buhua Liu
Shitong Shao
Bao Li
Lichen Bai
Zhiqiang Xu
Haoyi Xiong
James Kwok
Sumi Helal
Bo Han
464
22
0
11 Sep 2024
AGR: Age Group fairness Reward for Bias Mitigation in LLMs
AGR: Age Group fairness Reward for Bias Mitigation in LLMsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Shuirong Cao
Ruoxi Cheng
Zhiqiang Wang
178
12
0
06 Sep 2024
Programming Refusal with Conditional Activation Steering
Programming Refusal with Conditional Activation SteeringInternational Conference on Learning Representations (ICLR), 2024
Bruce W. Lee
Inkit Padhi
Karthikeyan N. Ramamurthy
Erik Miehling
Pierre Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
502
70
0
06 Sep 2024
Efficient LLM Context Distillation
Efficient LLM Context Distillation
Rajesh Upadhayayaya
Zachary Smith
Chritopher Kottmyer
Manish Raj Osti
375
3
0
03 Sep 2024
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Ingo Ziegler
Abdullatif Köksal
Desmond Elliott
Hinrich Schütze
279
13
0
03 Sep 2024
User-Driven Value Alignment: Understanding Users' Perceptions and
  Strategies for Addressing Biased and Discriminatory Statements in AI
  Companions
User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI CompanionsInternational Conference on Human Factors in Computing Systems (CHI), 2024
Xianzhe Fan
Qing Xiao
Xuhui Zhou
Jiaxin Pei
Maarten Sap
Zhicong Lu
Hong Shen
314
23
0
01 Sep 2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding
  Data
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead
Jacob Phillips
Sean Hendryx
183
0
0
30 Aug 2024
Legilimens: Practical and Unified Content Moderation for Large Language
  Model Services
Legilimens: Practical and Unified Content Moderation for Large Language Model ServicesConference on Computer and Communications Security (CCS), 2024
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wei Dong
356
12
0
28 Aug 2024
Previous
123456...131415
Next