v1v2 (latest)

Evaluating Language-Model Agents on Realistic Autonomous Tasks

18 December 2023

Papers citing "Evaluating Language-Model Agents on Realistic Autonomous Tasks"

18 / 68 papers shown

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

315

12 Feb 2024

Towards Unified Alignment Between Agents, Humans, and Environment

...

Peng Li

Yang Liu

286

12 Feb 2024

Reducing Selection Bias in Large Language Models

Jonathan E. Eicher

Rafael F. Irgoliˇc

241

29 Jan 2024

Black-Box Access is Insufficient for Rigorous AI AuditsConference on Fairness, Accountability and Transparency (FAccT), 2024

...

Dylan Hadfield-Menell

AAML

556

128

25 Jan 2024

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM AgentsNeural Information Processing Systems (NeurIPS), 2024

Yujiu Yang

Lingpeng Kong

227

130

24 Jan 2024

Visibility into AI AgentsConference on Fairness, Accountability and Transparency (FAccT), 2024

...

774

23 Jan 2024

Tell, don't show: Declarative facts influence how LLMs generalize

Alexander Meinke

Owain Evans

223

12 Dec 2023

Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

...

Rui Wang

363

20 Nov 2023

Testing Language Model Agents Safely in the Wild

Adam Tauman Kalai

263

17 Nov 2023

Large Language Models can Strategically Deceive their Users when Put Under Pressure

434

09 Nov 2023

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

248

143

31 Oct 2023

Managing extreme AI risks amid rapid progress

...

341

26 Oct 2023

An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI

...

257

22 Oct 2023

Welfare Diplomacy: Benchmarking Language Model Cooperation

346

13 Oct 2023

MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationInternational Conference on Machine Learning (ICML), 2023

268

160

05 Oct 2023

Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectivesSocial Science Research Network (SSRN), 2023

Elizabeth Seger

...

206

29 Sep 2023

Identifying the Risks of LM Agents with an LM-Emulated SandboxInternational Conference on Learning Representations (ICLR), 2023

Silviu Pitis

Jimmy Ba

Tatsunori Hashimoto

248

188

25 Sep 2023

AI Deception: A Survey of Examples, Risks, and Potential Solutions

303

239

28 Aug 2023