v1v2 (latest)

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

14 February 2024

ArXiv (abs)PDF HTML Github (4★)

Papers citing "AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability"

7 / 7 papers shown

DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models

269

18 Nov 2025

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

270

22 Mar 2025

Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination

346

23 Oct 2024

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Yao Mu

345

18 Aug 2024

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Xiaoyang Wang

255

17 Jun 2024

EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential ReasoningConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Yinzhu Quan

Zefang Liu

265

13 May 2024

Large AI Model-Based Semantic CommunicationsIEEE wireless communications (IEEE Wireless Commun.), 2023

266

123

07 Jul 2023