ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.05915
  4. Cited By
Fake Alignment: Are LLMs Really Aligned Well?

Fake Alignment: Are LLMs Really Aligned Well?

10 November 2023
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
ArXivPDFHTML

Papers citing "Fake Alignment: Are LLMs Really Aligned Well?"

13 / 13 papers shown
Title
How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension
How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension
Hao Li
Liuzhenghao Lv
He Cao
Zijing Liu
Zhiyuan Yan
Yu Wang
Yonghong Tian
Y. Li
Li Yuan
27
0
0
10 Apr 2025
Bypassing Safety Guardrails in LLMs Using Humor
Bypassing Safety Guardrails in LLMs Using Humor
Pedro Cisneros-Velarde
29
0
0
09 Apr 2025
Large Language Models Often Say One Thing and Do Another
Ruoxi Xu
Hongyu Lin
Xianpei Han
Jia Zheng
Weixiang Zhou
Le Sun
Yingfei Sun
39
1
0
10 Mar 2025
Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs
Zara Siddique
Irtaza Khalid
Liam D. Turner
Luis Espinosa-Anke
LLMSV
56
0
0
07 Mar 2025
LongSafety: Enhance Safety for Long-Context LLMs
LongSafety: Enhance Safety for Long-Context LLMs
Mianqiu Huang
Xiaoran Liu
Shaojun Zhou
Mozhi Zhang
Chenkun Tan
...
Zhikai Lei
Linlin Li
Q. Liu
Yaqian Zhou
Xipeng Qiu
ELM
ALM
32
0
0
11 Nov 2024
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking
  System for Evaluating Chinese Medical Large Language Models
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
Mianxin Liu
Jinru Ding
Jie Xu
Weiguo Hu
Xiaoyang Li
...
Haofen Wang
Tong Ruan
Xuanjing Huang
Xin Sun
Shaoting Zhang
ELM
AI4MH
LM&MA
19
9
0
24 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
76
3
0
19 Jun 2024
Evaluating the External and Parametric Knowledge Fusion of Large
  Language Models
Evaluating the External and Parametric Knowledge Fusion of Large Language Models
Hao Zhang
Yuyang Zhang
Xiaoguang Li
Wenxuan Shi
Haonan Xu
...
Yasheng Wang
Lifeng Shang
Qun Liu
Yong-jin Liu
Ruiming Tang
KELM
30
4
0
29 May 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
25
66
0
29 Jan 2024
Don't Make Your LLM an Evaluation Benchmark Cheater
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
105
136
0
03 Nov 2023
Denevil: Towards Deciphering and Navigating the Ethical Values of Large
  Language Models via Instruction Learning
Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning
Shitong Duan
Xiaoyuan Yi
Peng Zhang
T. Lu
Xing Xie
Ning Gu
11
9
0
17 Oct 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
203
1,651
0
15 Oct 2021
1