Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2410.12621
Cited By
v1
v2 (latest)
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
16 October 2024
Ruimeng Ye
Yang Xiao
Bo Hui
ALM
ELM
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning"
31 / 31 papers shown
Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models
Ruimeng Ye
Zihan Wang
Yang Xiao
Zinan Ling
Manling Li
Bo Hui
OffRL
272
0
0
25 Jul 2025
How to Mitigate Overfitting in Weak-to-strong Generalization?
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Junhao Shi
Qinyuan Cheng
Zhaoye Fei
Y. Zheng
Qipeng Guo
Xipeng Qiu
346
1
0
06 Mar 2025
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration
Wei Yao
Wenkai Yang
Liang Luo
Yankai Lin
Yong Liu
Yong Liu
ELM
1.0K
3
0
03 Feb 2025
Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Yue Guo
Yi Yang
326
15
0
27 Jun 2024
Theoretical Analysis of Weak-to-Strong Generalization
Hunter Lang
David Sontag
Aravindan Vijayaraghavan
507
43
0
25 May 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Neural Information Processing Systems (NeurIPS), 2024
Zhiqing Sun
Longhui Yu
Yikang Shen
Weiyang Liu
Yiming Yang
Sean Welleck
Chuang Gan
266
107
0
14 Mar 2024
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views
Yuji Roh
Qingyun Liu
Huan Gui
Zhe Yuan
Yujin Tang
...
Liang Liu
Shuchao Bi
Lichan Hong
Ed H. Chi
Zhe Zhao
456
5
0
07 Feb 2024
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Jianyuan Guo
Hanting Chen
Chengcheng Wang
Kai Han
Chang Xu
Yunhe Wang
VLM
202
28
0
06 Feb 2024
Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning
Jitao Sang
Yuhang Wang
Jing Zhang
Yanxu Zhu
Chao Kong
Junhong Ye
Shuyu Wei
Jinlin Xiao
314
17
0
01 Feb 2024
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li
Yong Zhang
Shwai He
Zhitao Li
Hongyu Zhao
Jianzong Wang
Ning Cheng
Wanrong Zhu
480
124
0
01 Feb 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Bing Wang
Rui Zheng
Luyao Chen
Yan Liu
Jiajun Sun
...
Tao Gui
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yuanyuan Jiang
ALM
400
151
0
11 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
International Conference on Machine Learning (ICML), 2023
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
439
420
0
14 Dec 2023
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity
Shiyao Cui
Zhenyu Zhang
Yilong Chen
Wenyuan Zhang
Tianyun Liu
Siqi Wang
Tingwen Liu
251
24
0
30 Nov 2023
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
Chao Jiang
Bo Hui
Bohan Liu
Da Yan
DiffM
302
16
0
28 Oct 2023
Large Language Model Alignment: A Survey
Shangda Wu
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
441
302
0
26 Sep 2023
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
754
290
0
06 Sep 2023
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
Social Science Research Network (SSRN), 2023
Neel Guha
Julian Nyarko
Mark A. Lemley
Christopher Ré
Adam Chilton
...
Spencer Williams
Sunny G. Gandhi
Tomer Zur
Varun J. Iyer
Zehua Li
AILaw
LRM
ELM
294
341
0
20 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
701
2,603
0
27 Jul 2023
Evaluating Superhuman Models with Consistency Checks
Lukas Fluri
Daniel Paleka
Florian Tramèr
ELM
443
50
0
16 Jun 2023
A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Hayeon Lee
Rui Hou
Jongpil Kim
Davis Liang
Sung Ju Hwang
Alexander Min
200
10
0
26 May 2023
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
Neural Information Processing Systems (NeurIPS), 2023
Andreas Kopf
Yannic Kilcher
Dimitri von Rutte
Sotiris Anagnostidis
Zhi Rui Tam
...
Arnav Dantuluri
Andrew Maguire
Christoph Schuhmann
Huu Nguyen
A. Mattick
ALM
LM&MA
910
815
0
14 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
International Conference on Machine Learning (ICML), 2023
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
650
181
0
06 Apr 2023
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Omar Shaikh
Hongxin Zhang
William B. Held
Michael S. Bernstein
Diyi Yang
ReLM
LRM
523
252
0
15 Dec 2022
Training language models to follow instructions with human feedback
Neural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
2.3K
18,946
0
04 Mar 2022
SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures
Megan Ung
Jing Xu
Y-Lan Boureau
217
52
0
14 Oct 2021
Denoising Diffusion Implicit Models
International Conference on Learning Representations (ICLR), 2020
Jiaming Song
Chenlin Meng
Stefano Ermon
VLM
DiffM
1.8K
11,047
0
06 Oct 2020
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Findings (Findings), 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
1.2K
1,557
0
24 Sep 2020
Constrained Labeling for Weakly Supervised Learning
Conference on Uncertainty in Artificial Intelligence (UAI), 2020
Chidubem Arachie
Bert Huang
385
17
0
15 Sep 2020
Denoising Diffusion Probabilistic Models
Jonathan Ho
Ajay Jain
Pieter Abbeel
DiffM
5.9K
27,989
0
19 Jun 2020
Aligning Superhuman AI with Human Behavior: Chess as a Model System
Knowledge Discovery and Data Mining (KDD), 2020
Reid McIlroy-Young
S. Sen
Jon M. Kleinberg
Ashton Anderson
GNN
517
135
0
02 Jun 2020
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Hossein Hosseini
Sreeram Kannan
Baosen Zhang
Radha Poovendran
AAML
477
357
0
27 Feb 2017
1
Page 1 of 1