Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.15286
Cited By
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
21 July 2024
Guang-Da Liu
Haitao Mao
Jiliang Tang
K. Johnson
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis"
7 / 7 papers shown
Title
Smaller Large Language Models Can Do Moral Self-Correction
Guangliang Liu
Zhiyu Xue
Rongrong Wang
K. Johnson
Kristen Marie Johnson
LRM
23
0
0
30 Oct 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Kaya Stechly
Matthew Marquez
Subbarao Kambhampati
LRM
155
84
0
19 Oct 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
189
261
0
28 Apr 2023
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish
Angelica Chen
Nikita Nangia
Vishakh Padmakumar
Jason Phang
Jana Thompson
Phu Mon Htut
Sam Bowman
212
367
0
15 Oct 2021
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick
Sahana Udupa
Hinrich Schütze
257
374
0
28 Feb 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
221
402
0
24 Feb 2021
1