Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.02619
Cited By
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
3 June 2024
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits"
2 / 2 papers shown
Title
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
44
51
0
10 May 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
486
0
01 Nov 2022
1