AI Agents That Matter

1 July 2024

Papers citing "AI Agents That Matter"

27 / 27 papers shown

Title
Cost-of-Pass: An Economic Framework for Evaluating Language Models Mehmet Hamza Erol Batu El Mirac Suzgun Mert Yuksekgonul J. Zou ELM 31 0 0 17 Apr 2025
Evaluating the Goal-Directedness of Large Language Models Tom Everitt Cristina Garbacea Alexis Bellot Jonathan G. Richens Henry Papadatos Simeon Campos Rohin Shah ELM LM&MA LM&Ro LRM 68 0 0 16 Apr 2025
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? Yunxiang Zhang Muhammad Khalifa Shitanshu Bhushan Grant D Murphy Lajanugen Logeswaran Jaekyeom Kim Moontae Lee Honglak Lee Lu Wang LLMAG ELM 62 0 0 13 Apr 2025
Attention-Aware Multi-View Pedestrian Tracking Reef Alturki Adrian Hilton Jean-Yves Guillemaut 28 0 0 03 Apr 2025
OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination Tobias Gessler Tin Dizdarevic Ani Calinescu Benjamin Ellis Andrei Lupu Jakob Foerster 49 0 0 22 Mar 2025
Multi-Agent Systems Execute Arbitrary Malicious Code Harold Triedman Rishi Jha Vitaly Shmatikov LLMAG AAML 89 2 0 15 Mar 2025
Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results Peter Fettke Constantin Houy ELM 35 0 0 14 Mar 2025
Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions Mourad Gridach Jay Nanavati Khaldoun Zine El Abidine Lenon Mendes Christina Mack 48 3 0 12 Mar 2025
Measuring AI agent autonomy: Towards a scalable approach with code inspection Peter Cihon Merlin Stein Gagan Bansal Sam Manning Kevin Xu 26 0 0 21 Feb 2025
The AI Agent Index Stephen Casper Luke Bailey Rosco Hunter Carson Ezell Emma Cabalé ... Phillip J. K. Christoffersen A. Pinar Ozisik Rakshit Trivedi Dylan Hadfield-Menell Noam Kolt 66 4 0 03 Feb 2025
Cocoa: Co-Planning and Co-Execution with AI Agents K. J. Kevin Feng Kevin Pu Matt Latzke Tal August Pao Siangliulue Jonathan Bragg Daniel S. Weld Amy X. Zhang Joseph Chee Chang LM&Ro LLMAG 87 4 0 14 Dec 2024
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications Raphael Shu Nilaksh Das Michelle Yuan Monica Sunkara Yi Zhang LLMAG 69 2 0 06 Dec 2024
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models Yanxi Chen Xuchen Pan Yaliang Li Bolin Ding Jingren Zhou LRM 70 1 0 29 Nov 2024
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers Benedikt Stroebl Sayash Kapoor Arvind Narayanan LRM 82 6 0 26 Nov 2024
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents Ido Levy Ben wiesel Sami Marreed Alon Oved Avi Yaeli Segev Shlomov LLMAG 29 13 0 09 Oct 2024
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning Jonas Gehring Kunhao Zheng Jade Copet Vegard Mella Taco Cohen Gabriel Synnaeve LLMAG 27 20 0 02 Oct 2024
SEAL: Suite for Evaluating API-use of LLMs Woojeong Kim Ashish Jagmohan Aditya Vempaty ELM ALM LLMAG 30 0 0 23 Sep 2024
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Zachary S. Siegel Sayash Kapoor Nitya Nagdir Benedikt Stroebl Arvind Narayanan 27 8 0 17 Sep 2024
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering Yiqing Shen Zan Chen Michail Mamalakis Yungeng Liu Tianbin Li Yanzhou Su Junjun He Pietro Liò Yu Guang Wang LLMAG 30 8 0 27 Aug 2024
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems Tamer Abuelsaad Deepak Akkil Prasenjit Dey Ashish Jagmohan Aditya Vempaty Ravi Kokku 39 23 0 17 Jul 2024
Questionable practices in machine learning Gavin Leech Juan J. Vazquez Misha Yagudin Niclas Kupper Laurence Aitchison 42 2 0 17 Jul 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models Sean Welleck Amanda Bertsch Matthew Finlayson Hailey Schoelkopf Alex Xie Graham Neubig Ilia Kulikov Zaid Harchaoui 33 45 0 24 Jun 2024
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering John Yang Carlos E. Jimenez Alexander Wettig K. Lieret Shunyu Yao Karthik Narasimhan Ofir Press LLMAG 99 188 0 06 May 2024
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? Alexandre Drouin Maxime Gasse Massimo Caccia I. Laradji Manuel Del Verme ... Megh Thakkar Quentin Cappart David Vazquez Nicolas Chapados Alexandre Lacoste LLMAG 48 51 0 12 Mar 2024
More Agents Is All You Need Junyou Li Qin Zhang Yangbin Yu Qiang Fu Deheng Ye LLMAG 133 57 0 03 Feb 2024
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks Subbarao Kambhampati Karthik Valmeekam L. Guan Mudit Verma Kaya Stechly Siddhant Bhambri Lucas Saldyt Anil Murthy LRM 78 107 0 02 Feb 2024
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback Xingyao Wang Zihan Wang Jiateng Liu Yangyi Chen Lifan Yuan Hao Peng Heng Ji LRM 125 137 0 19 Sep 2023