Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.11431
Cited By
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
17 June 2024
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
11 / 11 papers shown
Title
DeepCritic: Deliberate Critique with Large Language Models
Wenkai Yang
Jingwen Chen
Yankai Lin
Ji-Rong Wen
ALM
LRM
23
0
0
01 May 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Y. Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
80
0
0
24 Apr 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
29
0
0
08 Mar 2025
How to Mitigate Overfitting in Weak-to-strong Generalization?
Junhao Shi
Qinyuan Cheng
Zhaoye Fei
Y. Zheng
Qipeng Guo
Xipeng Qiu
65
0
0
06 Mar 2025
Understanding the Capabilities and Limitations of Weak-to-Strong Generalization
Wei Yao
Wenkai Yang
Z. Wang
Yankai Lin
Yong Liu
ELM
86
1
0
03 Feb 2025
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Z. Z. Ren
53
6
0
10 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting
David X. Wu
A. Sahai
52
6
0
06 Oct 2024
Selective Preference Optimization via Token-Level Reward Function Estimation
Kailai Yang
Zhiwei Liu
Qianqian Xie
Jimin Huang
Erxue Min
Sophia Ananiadou
20
9
0
24 Aug 2024
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Zhanhui Zhou
Zhixuan Liu
Jie Liu
Zhichen Dong
Chao Yang
Yu Qiao
ALM
33
20
0
29 May 2024
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,927
0
20 Apr 2018
1