ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.11671
  4. Cited By
Evaluating Language-Model Agents on Realistic Autonomous Tasks
v1v2 (latest)

Evaluating Language-Model Agents on Realistic Autonomous Tasks

18 December 2023
Megan Kinniment
Lucas Jun Koba Sato
Haoxing Du
Brian Goodrich
Max Hasin
Lawrence Chan
Luke Harold Miles
Tao Lin
H. Wijk
Joel Burget
Aaron Ho
Elizabeth Barnes
Paul Christiano
    ELMLLMAG
ArXiv (abs)PDFHTML

Papers citing "Evaluating Language-Model Agents on Realistic Autonomous Tasks"

18 / 68 papers shown
AI-Augmented Predictions: LLM Assistants Improve Human Forecasting
  Accuracy
AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
P. Schoenegger
Peter S. Park
Ezra Karger
P. Tetlock
315
29
0
12 Feb 2024
Towards Unified Alignment Between Agents, Humans, and Environment
Towards Unified Alignment Between Agents, Humans, and Environment
Zonghan Yang
An Liu
Zijun Liu
Wenbing Huang
Fangzhou Xiong
...
Zhenhe Zhang
Ziyue Wang
Zhicheng Guo
Peng Li
Yang Liu
286
5
0
12 Feb 2024
Reducing Selection Bias in Large Language Models
Reducing Selection Bias in Large Language Models
Jonathan E. Eicher
Rafael F. Irgoliˇc
241
1
0
29 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI AuditsConference on Fairness, Accountability and Transparency (FAccT), 2024
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
556
128
0
25 Jan 2024
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM AgentsNeural Information Processing Systems (NeurIPS), 2024
Chang Ma
Junlei Zhang
Zhihao Zhu
Cheng Yang
Yujiu Yang
Yaohui Jin
Zhenzhong Lan
Lingpeng Kong
Junxian He
ELMLLMAG
227
130
0
24 Jan 2024
Visibility into AI Agents
Visibility into AI AgentsConference on Fairness, Accountability and Transparency (FAccT), 2024
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
774
76
0
23 Jan 2024
Tell, don't show: Declarative facts influence how LLMs generalize
Tell, don't show: Declarative facts influence how LLMs generalize
Alexander Meinke
Owain Evans
223
9
0
12 Dec 2023
Igniting Language Intelligence: The Hitchhiker's Guide From
  Chain-of-Thought Reasoning to Language Agents
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Zhuosheng Zhang
Yao Yao
Aston Zhang
Xiangru Tang
Xinbei Ma
...
Yiming Wang
Mark B. Gerstein
Rui Wang
Gongshen Liu
Hai Zhao
LLMAGLM&RoLRM
363
91
0
20 Nov 2023
Testing Language Model Agents Safely in the Wild
Testing Language Model Agents Safely in the Wild
Silen Naihin
David Atkinson
Marc Green
Merwane Hamadi
Craig Swift
Douglas Schonholtz
Adam Tauman Kalai
David Bau
LLMAG
263
36
0
17 Nov 2023
Large Language Models can Strategically Deceive their Users when Put
  Under Pressure
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Jérémy Scheurer
Mikita Balesni
Marius Hobbhahn
LLMAG
434
91
0
09 Nov 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
248
143
0
31 Oct 2023
Managing extreme AI risks amid rapid progress
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Juil Sock
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
341
30
0
26 Oct 2023
An International Consortium for Evaluations of Societal-Scale Risks from
  Advanced AI
An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI
Ross Gruetzemacher
Alan Chan
Kevin Frazier
Christy Manning
Stepán Los
...
Clíodhna Ní Ghuidhir
Mark M. Bailey
Daniel Eth
Toby D. Pilditch
Kyle A. Kilian
257
8
0
22 Oct 2023
Welfare Diplomacy: Benchmarking Language Model Cooperation
Welfare Diplomacy: Benchmarking Language Model Cooperation
Gabriel Mukobi
Hannah Erlebach
Niklas Lauffer
Lewis Hammond
Alan Chan
Jesse Clifton
LM&Ro
346
39
0
13 Oct 2023
MLAgentBench: Evaluating Language Agents on Machine Learning
  Experimentation
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationInternational Conference on Machine Learning (ICML), 2023
Qian Huang
Jian Vora
Abigail Z. Jacobs
J. Leskovec
ELMLLMAG
268
160
0
05 Oct 2023
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks,
  benefits, and alternative methods for pursuing open-source objectives
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectivesSocial Science Research Network (SSRN), 2023
Elizabeth Seger
Noemi Dreksler
Richard Moulange
Emily Dardaman
Jonas Schuett
...
Emma Bluemke
Michael Aird
Patrick Levermore
Julian Hazell
Abhishek Gupta
206
56
0
29 Sep 2023
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Identifying the Risks of LM Agents with an LM-Emulated SandboxInternational Conference on Learning Representations (ICLR), 2023
Yangjun Ruan
Honghua Dong
Andrew Wang
Silviu Pitis
Yongchao Zhou
Jimmy Ba
Yann Dubois
Chris J. Maddison
Tatsunori Hashimoto
LLMAGELM
248
188
0
25 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
303
239
0
28 Aug 2023
Previous
12