v1v2 (latest)

Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

16 October 2024

Papers citing "Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning"

31 / 31 papers shown

Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models

272

25 Jul 2025

How to Mitigate Overfitting in Weak-to-strong Generalization?Annual Meeting of the Association for Computational Linguistics (ACL), 2025

346

06 Mar 2025

The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration

1.0K

03 Feb 2025

Improving Weak-to-Strong Generalization with Reliability-Aware Alignment

Yue Guo

Yi Yang

326

27 Jun 2024

Theoretical Analysis of Weak-to-Strong Generalization

Hunter Lang

David Sontag

Aravindan Vijayaraghavan

507

25 May 2024

Easy-to-Hard Generalization: Scalable Alignment Beyond Human SupervisionNeural Information Processing Systems (NeurIPS), 2024

Chuang Gan

266

107

14 Mar 2024

LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

...

456

07 Feb 2024

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

202

06 Feb 2024

Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning

314

01 Feb 2024

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

480

124

01 Feb 2024

Secrets of RLHF in Large Language Models Part II: Reward Modeling

...

Xipeng Qiu

Xuanjing Huang

Zuxuan Wu

Yuanyuan Jiang

ALM

400

151

11 Jan 2024

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak SupervisionInternational Conference on Machine Learning (ICML), 2023

...

439

420

14 Dec 2023

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

251

30 Nov 2023

Successfully Applying Lottery Ticket Hypothesis to Diffusion Model

302

28 Oct 2023

Large Language Model Alignment: A Survey

441

302

26 Sep 2023

Certifying LLM Safety against Adversarial Prompting

Himabindu Lakkaraju

754

290

06 Sep 2023

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language ModelsSocial Science Research Network (SSRN), 2023

...

294

341

20 Aug 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

J. Zico Kolter

701

2,603

27 Jul 2023

Evaluating Superhuman Models with Consistency Checks

443

16 Jun 2023

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

200

26 May 2023

OpenAssistant Conversations -- Democratizing Large Language Model AlignmentNeural Information Processing Systems (NeurIPS), 2023

...

910

815

14 Apr 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI BenchmarkInternational Conference on Machine Learning (ICML), 2023

650

181

06 Apr 2023

On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Diyi Yang

523

252

15 Dec 2022

Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022

Carroll L. Wainwright

...

2.3K

18,946

04 Mar 2022

SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures

Megan Ung

Jing Xu

Y-Lan Boureau

217

14 Oct 2021

Denoising Diffusion Implicit ModelsInternational Conference on Learning Representations (ICLR), 2020

1.8K

11,047

06 Oct 2020

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language ModelsFindings (Findings), 2020

Yejin Choi

1.2K

1,557

24 Sep 2020

Constrained Labeling for Weakly Supervised LearningConference on Uncertainty in Artificial Intelligence (UAI), 2020

Chidubem Arachie

Bert Huang

385

15 Sep 2020

Denoising Diffusion Probabilistic Models

Jonathan Ho

Ajay Jain

Pieter Abbeel

DiffM

5.9K

27,989

19 Jun 2020

Aligning Superhuman AI with Human Behavior: Chess as a Model SystemKnowledge Discovery and Data Mining (KDD), 2020

517

135

02 Jun 2020

Deceiving Google's Perspective API Built for Detecting Toxic Comments

477

357

27 Feb 2017