ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.12621
  4. Cited By
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
v1v2 (latest)

Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

16 October 2024
Ruimeng Ye
Yang Xiao
Bo Hui
    ALMELMOffRL
ArXiv (abs)PDFHTML

Papers citing "Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning"

31 / 31 papers shown
Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models
Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models
Ruimeng Ye
Zihan Wang
Yang Xiao
Zinan Ling
Manling Li
Bo Hui
OffRL
272
0
0
25 Jul 2025
How to Mitigate Overfitting in Weak-to-strong Generalization?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Junhao Shi
Qinyuan Cheng
Zhaoye Fei
Y. Zheng
Qipeng Guo
Xipeng Qiu
346
1
0
06 Mar 2025
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration
Wei Yao
Wenkai Yang
Liang Luo
Yankai Lin
Yong Liu
Yong Liu
ELM
1.0K
3
0
03 Feb 2025
Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Yue Guo
Yi Yang
326
15
0
27 Jun 2024
Theoretical Analysis of Weak-to-Strong Generalization
Theoretical Analysis of Weak-to-Strong Generalization
Hunter Lang
David Sontag
Aravindan Vijayaraghavan
507
43
0
25 May 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Easy-to-Hard Generalization: Scalable Alignment Beyond Human SupervisionNeural Information Processing Systems (NeurIPS), 2024
Zhiqing Sun
Longhui Yu
Yikang Shen
Weiyang Liu
Yiming Yang
Sean Welleck
Chuang Gan
266
107
0
14 Mar 2024
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different
  Views
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views
Yuji Roh
Qingyun Liu
Huan Gui
Zhe Yuan
Yujin Tang
...
Liang Liu
Shuchao Bi
Lichan Hong
Ed H. Chi
Zhe Zhao
456
5
0
07 Feb 2024
Vision Superalignment: Weak-to-Strong Generalization for Vision
  Foundation Models
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Jianyuan Guo
Hanting Chen
Chengcheng Wang
Kai Han
Chang Xu
Yunhe Wang
VLM
202
28
0
06 Feb 2024
Improving Weak-to-Strong Generalization with Scalable Oversight and
  Ensemble Learning
Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning
Jitao Sang
Yuhang Wang
Jing Zhang
Yanxu Zhu
Chao Kong
Junhong Ye
Shuyu Wei
Jinlin Xiao
314
17
0
01 Feb 2024
Superfiltering: Weak-to-Strong Data Filtering for Fast
  Instruction-Tuning
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li
Yong Zhang
Shwai He
Zhitao Li
Hongyu Zhao
Jianzong Wang
Ning Cheng
Wanrong Zhu
480
124
0
01 Feb 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Bing Wang
Rui Zheng
Luyao Chen
Yan Liu
Jiajun Sun
...
Tao Gui
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yuanyuan Jiang
ALM
400
151
0
11 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
  Supervision
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak SupervisionInternational Conference on Machine Learning (ICML), 2023
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
439
420
0
14 Dec 2023
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with
  Factuality, Fairness, Toxicity
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity
Shiyao Cui
Zhenyu Zhang
Yilong Chen
Wenyuan Zhang
Tianyun Liu
Siqi Wang
Tingwen Liu
251
24
0
30 Nov 2023
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
Chao Jiang
Bo Hui
Bohan Liu
Da Yan
DiffM
302
16
0
28 Oct 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Shangda Wu
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
441
302
0
26 Sep 2023
Certifying LLM Safety against Adversarial Prompting
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
754
290
0
06 Sep 2023
LegalBench: A Collaboratively Built Benchmark for Measuring Legal
  Reasoning in Large Language Models
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language ModelsSocial Science Research Network (SSRN), 2023
Neel Guha
Julian Nyarko
Mark A. Lemley
Christopher Ré
Adam Chilton
...
Spencer Williams
Sunny G. Gandhi
Tomer Zur
Varun J. Iyer
Zehua Li
AILawLRMELM
294
341
0
20 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
701
2,603
0
27 Jul 2023
Evaluating Superhuman Models with Consistency Checks
Evaluating Superhuman Models with Consistency Checks
Lukas Fluri
Daniel Paleka
Florian Tramèr
ELM
443
50
0
16 Jun 2023
A Study on Knowledge Distillation from Weak Teacher for Scaling Up
  Pre-trained Language Models
A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Hayeon Lee
Rui Hou
Jongpil Kim
Davis Liang
Sung Ju Hwang
Alexander Min
200
10
0
26 May 2023
OpenAssistant Conversations -- Democratizing Large Language Model
  Alignment
OpenAssistant Conversations -- Democratizing Large Language Model AlignmentNeural Information Processing Systems (NeurIPS), 2023
Andreas Kopf
Yannic Kilcher
Dimitri von Rutte
Sotiris Anagnostidis
Zhi Rui Tam
...
Arnav Dantuluri
Andrew Maguire
Christoph Schuhmann
Huu Nguyen
A. Mattick
ALMLM&MA
910
815
0
14 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI BenchmarkInternational Conference on Machine Learning (ICML), 2023
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
650
181
0
06 Apr 2023
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in
  Zero-Shot Reasoning
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Omar Shaikh
Hongxin Zhang
William B. Held
Michael S. Bernstein
Diyi Yang
ReLMLRM
523
252
0
15 Dec 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
2.3K
18,946
0
04 Mar 2022
SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety
  Failures
SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures
Megan Ung
Jing Xu
Y-Lan Boureau
217
52
0
14 Oct 2021
Denoising Diffusion Implicit Models
Denoising Diffusion Implicit ModelsInternational Conference on Learning Representations (ICLR), 2020
Jiaming Song
Chenlin Meng
Stefano Ermon
VLMDiffM
1.8K
11,047
0
06 Oct 2020
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
  Models
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language ModelsFindings (Findings), 2020
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
1.2K
1,557
0
24 Sep 2020
Constrained Labeling for Weakly Supervised Learning
Constrained Labeling for Weakly Supervised LearningConference on Uncertainty in Artificial Intelligence (UAI), 2020
Chidubem Arachie
Bert Huang
385
17
0
15 Sep 2020
Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models
Jonathan Ho
Ajay Jain
Pieter Abbeel
DiffM
5.9K
27,989
0
19 Jun 2020
Aligning Superhuman AI with Human Behavior: Chess as a Model System
Aligning Superhuman AI with Human Behavior: Chess as a Model SystemKnowledge Discovery and Data Mining (KDD), 2020
Reid McIlroy-Young
S. Sen
Jon M. Kleinberg
Ashton Anderson
GNN
517
135
0
02 Jun 2020
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Hossein Hosseini
Sreeram Kannan
Baosen Zhang
Radha Poovendran
AAML
477
357
0
27 Feb 2017
1
Page 1 of 1