ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDaMoMe
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,527 papers shown
Environment Scaling for Interactive Agentic Experience Collection: A Survey
Environment Scaling for Interactive Agentic Experience Collection: A Survey
Y. Huang
S. Li
Minghao Liu
Wei Liu
Shijue Huang
Zhiyuan Fan
Hou Pong Chan
Yi R. Fung
273
0
0
24 Dec 2025
Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief
Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief
Zeguan Xiao
Diyang Dou
Boya Xiong
Yun-Nung Chen
Guanhua Chen
121
0
0
24 Dec 2025
Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment
Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment
Panatchakorn Anantaprayoon
Nataliia Babina
Jad Tarifi
Nima Asgharbeygi
118
1
0
05 Dec 2025
From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders
From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders
Amy Winecoff
Kevin Klyman
143
1
0
04 Dec 2025
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
Huy Nghiem
Swetasudha Panda
Devashish Khatwani
Huy Nguyen
Krishnaram Kenthapadi
Hal Daumé III
LM&MA
166
0
0
03 Dec 2025
Self-Improving AI Agents through Self-Play
Self-Improving AI Agents through Self-Play
Przemyslaw Chojecki
173
2
0
02 Dec 2025
WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate
WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate
A. Cherian
River Doyle
Eyal Ben-Dov
Suhas Lohit
Kuan-Chuan Peng
LLMAGMoE
140
0
0
02 Dec 2025
Invasive Context Engineering to Control Large Language Models
Invasive Context Engineering to Control Large Language Models
Thomas Rivasseau
105
0
0
02 Dec 2025
Many-to-One Adversarial Consensus: Exposing Multi-Agent Collusion Risks in AI-Based Healthcare
Many-to-One Adversarial Consensus: Exposing Multi-Agent Collusion Risks in AI-Based Healthcare
Adeela Bashir
Anh Han
Zia Ush Shamszaman
AAML
109
0
0
01 Dec 2025
Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism
Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism
Sandro Andric
54
0
0
01 Dec 2025
Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models
Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models
Heloisa Caroline de Souza Pereira Candello
Muneeza Azmat
Uma Sushmitha Gunturi
Raya Horesh
Rogerio Abreu de Paula
Heloisa Pimentel
Marcelo Carpinette Grave
Aminat Adebiyi
Tiago Machado
M. Macedo
93
0
0
01 Dec 2025
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks
Hyunjun Kim
Sooyoung Ryu
96
0
0
01 Dec 2025
ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization
ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization
Omer Jauhar Khan
LLMAG
222
0
0
29 Nov 2025
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
Kai Williams
Rohan Subramani
Francis Rhys Ward
113
0
0
29 Nov 2025
Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent
Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent
Jianzhe Lin
Zeyu Pan
Yun Zhu
Ruiqi Song
Jining Yang
LRM
164
0
0
28 Nov 2025
Are LLMs Good Safety Agents or a Propaganda Engine?
Are LLMs Good Safety Agents or a Propaganda Engine?
Neemesh Yadav
Francesco Ortu
Jiarui Liu
Joeun Yook
Bernhard Schölkopf
Rada Mihalcea
Alberto Cazzaniga
Zhijing Jin
116
0
0
28 Nov 2025
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Richard J. Young
ELM
178
0
0
27 Nov 2025
Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning
Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning
Jingchu Gai
Guanning Zeng
Huaqing Zhang
Aditi Raghunathan
176
5
0
25 Nov 2025
DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs
DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs
Yuanhao Li
Mingshan Liu
Hongbo Wang
Yiding Zhang
Yifei Ma
Wei Tan
AI4TSKELMLRMAI4CE
438
0
0
25 Nov 2025
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts
Xing Wang
Huiyuan Xie
Y. Wang
Chaojun Xiao
Huimin Chen
Holli Sargeant
Felix Steffek
Jie Shao
Zhiyuan Liu
Maosong Sun
AILawELM
395
0
0
25 Nov 2025
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation
Benyamin Tafreshian
85
0
0
24 Nov 2025
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Shireen Chand
Faith Baca
Emilio Ferrara
151
2
0
23 Nov 2025
Foundations of Artificial Intelligence Frameworks: Notion and Limits of AGI
Foundations of Artificial Intelligence Frameworks: Notion and Limits of AGI
Khanh Gia Bui
NAIAI4CE
412
0
0
23 Nov 2025
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Kartik Garg
Shourya Mishra
Kartikeya Sinha
Ojaswi Pratap Singh
Ayush Chopra
...
Ammar Sheikh
Raghav Maheshwari
Aman Chadha
Vinija Jain
Amitava Das
OffRL
177
1
0
22 Nov 2025
Curvature-Aware Safety Restoration In LLMs Fine-Tuning
Curvature-Aware Safety Restoration In LLMs Fine-Tuning
Thong Bach
Thanh Nguyen-Tang
D. Nguyen
T. Hoang Ngan Le
Truyen Tran
MoMe
170
1
0
22 Nov 2025
Evaluating Adversarial Vulnerabilities in Modern Large Language Models
Evaluating Adversarial Vulnerabilities in Modern Large Language Models
Tom Perel
AAMLSILMELM
329
0
0
21 Nov 2025
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Piercosma Bisconti
Matteo Prandi
Federico Pierucci
Francesco Giarrusso
Marcantonio Bracale
Marcello Galisai
Vincenzo Suriani
Olga E. Sorokoletova
Federico Sartore
Daniele Nardi
AAML
944
10
0
19 Nov 2025
Efficiency Will Not Lead to Sustainable Reasoning AI
Efficiency Will Not Lead to Sustainable Reasoning AI
Philipp Wiesner
Daniel W. OÑeill
Francesca Larosa
O. Kao
LRM
242
2
0
19 Nov 2025
The Specification Trap: Why Content-Based AI Value Alignment Cannot Produce Robust Alignment
The Specification Trap: Why Content-Based AI Value Alignment Cannot Produce Robust Alignment
Austin Spizzirri
59
0
0
19 Nov 2025
GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs
GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs
Yiyang Zhao
Huiyu Bai
Xuejiao Zhao
163
0
0
17 Nov 2025
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Yunhao Chen
Xin Wang
Juncheng Li
Yixu Wang
Jie Li
Yan Teng
Yingchun Wang
Xingjun Ma
AAML
333
1
0
16 Nov 2025
Rethinking Deep Alignment Through The Lens Of Incomplete Learning
Rethinking Deep Alignment Through The Lens Of Incomplete Learning
Thong Bach
D. Nguyen
T. Le
T. Tran
150
0
0
15 Nov 2025
Differences in the Moral Foundations of Large Language Models
Differences in the Moral Foundations of Large Language Models
Peter Kirgis
97
2
0
14 Nov 2025
SERL: Self-Examining Reinforcement Learning on Open-Domain
SERL: Self-Examining Reinforcement Learning on Open-Domain
Weixuan Ou
Yanzhao Zheng
Shuoshuo Sun
Wei Zhang
B. Dong
Hangcheng Zhu
Ruohui Huang
Gang Yu
Pengwei Yan
Yifan Qiao
LRM
311
1
0
11 Nov 2025
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon
A. Narayan
Hakki Orhun Akengin
J. Wes Griffin
Herumb Shandilya
...
Shang Zhu
Ben Athiwaratkun
John Hennessy
Azalia Mirhoseini
Christopher Ré
302
6
0
11 Nov 2025
Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison
Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison
Yoonho Lee
Joseph Boen
Chelsea Finn
255
15
0
11 Nov 2025
A Self-Improving Architecture for Dynamic Safety in Large Language Models
A Self-Improving Architecture for Dynamic Safety in Large Language Models
Tyler Slater
KELMELM
97
0
0
10 Nov 2025
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang
Mingzi Zhang
Xuanyu Yin
Sheng Jin
Suyu Lu
Zuocan Ying
Zengyi Yu
Xiangjie Kong
ELM
213
0
0
10 Nov 2025
Large Language Models Develop Novel Social Biases Through Adaptive Exploration
Large Language Models Develop Novel Social Biases Through Adaptive Exploration
Addison J. Wu
Ryan Liu
Xuechunzi Bai
Thomas Griffiths
244
0
0
08 Nov 2025
Catching Contamination Before Generation: Spectral Kill Switches for Agents
Catching Contamination Before Generation: Spectral Kill Switches for Agents
Valentin Noël
141
0
0
08 Nov 2025
Who Gets Heard? Rethinking Fairness in AI for Music Systems
Who Gets Heard? Rethinking Fairness in AI for Music Systems
Atharva Mehta
Shivam Chauhan
Megha Sharma
Gus Xia
Kaustuv Kanti Ganguli
Nishanth Chandran
Zeerak Talat
Monojit Choudhury
131
0
0
08 Nov 2025
Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
Prasoon Varshney
Makesh Narsimhan Sreedhar
Liwei Jiang
Traian Rebedea
Christopher Parisien
167
0
0
07 Nov 2025
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
Zahida Kausar
Seemab Latif
Raja Khurrum Shahzad
Mehwish Fatima
140
0
0
06 Nov 2025
GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation
GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation
Manh Trong Nguyen
Sunil R. Gupta
Dai Do
Hung Le
192
1
0
05 Nov 2025
Control Barrier Function for Aligning Large Language Models
Control Barrier Function for Aligning Large Language Models
Yuya Miyaoka
Masaki Inoue
260
2
0
05 Nov 2025
Systematizing LLM Persona Design: A Four-Quadrant Technical Taxonomy for AI Companion Applications
Systematizing LLM Persona Design: A Four-Quadrant Technical Taxonomy for AI Companion Applications
Esther Sun
Zichu Wu
210
0
0
04 Nov 2025
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
Xu Liu
Yan Chen
Kan Ling
Yichi Zhu
Hengrun Zhang
Guisheng Fan
Huiqun Yu
AAML
167
2
0
04 Nov 2025
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
Sharan Maiya
Henning Bartsch
Nathan Lambert
Evan Hubinger
147
6
0
03 Nov 2025
Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining
Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining
Adewale Akinfaderin
Shreyas Subramanian
Akarsha Sehwag
OffRL
84
2
0
03 Nov 2025
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Wenjin Liu
Haoran Luo
X. Lin
Haoming Liu
Tiesunlong Shen
Jiapu Wang
Rui Mao
Erik Cambria
LLMAGOffRLLRM
584
4
0
02 Nov 2025
1234...293031
Next
Page 1 of 31
Pageof 31