Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.11082
Cited By
Fundamental Limitations of Alignment in Large Language Models
19 April 2023
Yotam Wolf
Noam Wies
Oshri Avnery
Yoav Levine
Amnon Shashua
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fundamental Limitations of Alignment in Large Language Models"
24 / 24 papers shown
Title
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary
Yakai Li
Jiekang Hu
Weiduan Sang
Luping Ma
Jing Xie
Weijuan Zhang
Aimin Yu
Shijie Zhao
Qingjia Huang
Qihang Zhou
AAML
52
0
0
28 Apr 2025
Robust Concept Erasure Using Task Vectors
Minh Pham
Kelly O. Marshall
Chinmay Hegde
Niv Cohen
120
17
0
21 Feb 2025
REFA: Reference Free Alignment for multi-preference optimization
Taneesh Gupta
Rahul Madhavan
Xuchao Zhang
Chetan Bansal
Saravan Rajmohan
91
1
0
20 Dec 2024
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling
Michael Desmond
K. Ramamurthy
Elizabeth M. Daly
Pierre L. Dognin
Jesus Rios
Djallel Bouneffouf
Miao Liu
LLMSV
89
3
0
19 Nov 2024
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
30
0
0
17 Oct 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
54
1
0
05 Sep 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Akila Wickramasekara
F. Breitinger
Mark Scanlon
49
8
0
29 Feb 2024
Evaluating Language Model Agency through Negotiations
Tim R. Davidson
V. Veselovsky
Martin Josifoski
Maxime Peyrard
Antoine Bosselut
Michal Kosinski
Robert West
LLMAG
34
22
0
09 Jan 2024
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
118
0
09 Nov 2023
Can We Rely on AI?
D. Higham
AAML
35
0
0
29 Aug 2023
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
Pouya Pezeshkpour
Estevam R. Hruschka
LRM
15
126
0
22 Aug 2023
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
Gelei Deng
Yi Liu
Yuekang Li
Kailong Wang
Ying Zhang
Zefeng Li
Haoyu Wang
Tianwei Zhang
Yang Liu
SILM
37
118
0
16 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
83
839
0
05 Jul 2023
Playing repeated games with Large Language Models
Elif Akata
Lion Schulz
Julian Coda-Forno
Seong Joon Oh
Matthias Bethge
Eric Schulz
415
122
0
26 May 2023
In-Context Impersonation Reveals Large Language Models' Strengths and Biases
Leonard Salewski
Stephan Alaniz
Isabel Rio-Torto
Eric Schulz
Zeynep Akata
41
149
0
24 May 2023
A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation
Xiaowei Huang
Wenjie Ruan
Wei Huang
Gao Jin
Yizhen Dong
...
Sihao Wu
Peipei Xu
Dengyu Wu
André Freitas
Mustafa A. Mustafa
ALM
39
82
0
19 May 2023
Generative Agents: Interactive Simulacra of Human Behavior
J. Park
Joseph C. O'Brien
Carrie J. Cai
Meredith Ringel Morris
Percy Liang
Michael S. Bernstein
LM&Ro
AI4CE
232
1,742
0
07 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
298
3,007
0
22 Mar 2023
The Learnability of In-Context Learning
Noam Wies
Yoav Levine
Amnon Shashua
122
91
0
14 Mar 2023
On the Provable Advantage of Unsupervised Pretraining
Jiawei Ge
Shange Tang
Jianqing Fan
Chi Jin
SSL
33
16
0
02 Mar 2023
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Kai Greshake
Sahar Abdelnabi
Shailesh Mishra
C. Endres
Thorsten Holz
Mario Fritz
SILM
49
436
0
23 Feb 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,953
0
04 Mar 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
186
273
0
28 Sep 2021
Automatically Exposing Problems with Neural Dialog Models
Dian Yu
Kenji Sagae
31
9
0
14 Sep 2021
1