Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.11671
Cited By
Evaluating Language-Model Agents on Realistic Autonomous Tasks
18 December 2023
Megan Kinniment
Lucas Jun Koba Sato
Haoxing Du
Brian Goodrich
Max Hasin
Lawrence Chan
Luke Harold Miles
Tao R. Lin
H. Wijk
Joel Burget
Aaron Ho
Elizabeth Barnes
Paul Christiano
ELM
LLMAG
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Evaluating Language-Model Agents on Realistic Autonomous Tasks"
14 / 14 papers shown
Title
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
66
0
0
05 May 2025
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
Sid Black
Asa Cooper Stickland
Jake Pencharz
Oliver Sourbut
Michael Schmatz
Jay Bailey
Ollie Matthews
Ben Millwood
Alex Remedios
Alan Cooney
ELM
134
0
0
21 Apr 2025
Measuring AI agent autonomy: Towards a scalable approach with code inspection
Peter Cihon
Merlin Stein
Gagan Bansal
Sam Manning
Kevin Xu
36
0
0
21 Feb 2025
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
37
23
0
11 Jun 2024
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
David M. Krueger
30
13
0
29 May 2024
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles
Sarah Clinckemaillie
Yifan Chang
Jonathan Waltz
Gabrielle Lau
...
Daniel Toyama
Robert Berry
Divya Tyamagundlu
Timothy Lillicrap
Oriana Riva
LLMAG
62
44
0
23 May 2024
Securing the Future of GenAI: Policy and Technology
Mihai Christodorescu
Craven
S. Feizi
Neil Zhenqiang Gong
Mia Hoffmann
...
Jessica Newman
Emelia Probasco
Yanjun Qi
Khawaja Shams
Turek
SILM
38
3
0
21 May 2024
Towards Unified Alignment Between Agents, Humans, and Environment
Zonghan Yang
An Liu
Zijun Liu
Kai Liu
Fangzhou Xiong
...
Zhenhe Zhang
Fuwen Luo
Zhicheng Guo
Peng Li
Yang Liu
24
4
0
12 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
20
76
0
25 Jan 2024
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
Chang Ma
Junlei Zhang
Zhihao Zhu
Cheng Yang
Yujiu Yang
Yaohui Jin
Zhenzhong Lan
Lingpeng Kong
Junxian He
ELM
LLMAG
32
54
0
24 Jan 2024
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
20
80
0
31 Oct 2023
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
233
2,479
0
06 Oct 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
347
8,457
0
28 Jan 2022
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
194
624
0
20 May 2021
1