ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.14890
  4. Cited By
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
  Models via Complexity Classes

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

22 December 2023
Lizhou Fan
Wenyue Hua
Lingyao Li
Haoyang Ling
Yongfeng Zhang
    LRM
ArXivPDFHTML

Papers citing "NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes"

40 / 40 papers shown
Title
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving
Zeyu Zhang
Z. Chen
Zicheng Zhang
Yuze Sun
Yuan Tian
Ziheng Jia
Chunyi Li
Xiaohong Liu
Xiongkuo Min
Guangtao Zhai
MLLM
36
0
0
15 Apr 2025
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
Weiwei Sun
Shengyu Feng
Shanda Li
Yiming Yang
LLMAG
37
1
0
06 Apr 2025
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead
Vidhisha Balachandran
Jingya Chen
Lingjiao Chen
Shivam Garg
Neel Joshi
...
John Langford
Besmira Nushi
Vibhav Vineet
Yue Wu
Safoora Yousefi
ReLM
LRM
50
3
0
31 Mar 2025
Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs
Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs
Benjamin Estermann
Roger Wattenhofer
LRM
41
1
0
19 Mar 2025
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More
Arvid Frydenlund
LRM
48
0
0
13 Mar 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
H. A. Alyahya
Haidar Khan
Yazeed Alnumay
M Saiful Bari
B. Yener
LRM
63
1
0
10 Mar 2025
RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction
RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction
Jianhao Yan
Yun Luo
Yue Zhang
LLMAG
50
1
0
25 Feb 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
H. Li
Tao Xie
Baishakhi Ray
41
2
0
23 Feb 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
InductionBench: LLMs Fail in the Simplest Complexity Class
Wenyue Hua
Tyler Wong
Sun Fei
Liangming Pan
Adam Jardine
William Yang Wang
LRM
70
2
0
20 Feb 2025
Unbiased Evaluation of Large Language Models from a Causal Perspective
Unbiased Evaluation of Large Language Models from a Causal Perspective
Meilin Chen
Jian Tian
Liang Ma
Di Xie
Weijie Chen
Jiang Zhu
ALM
ELM
52
0
0
10 Feb 2025
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Maohao Shen
Guangtao Zeng
Zhenting Qi
Zhang-Wei Hong
Zhenfang Chen
Wei Lu
G. Wornell
Subhro Das
David D. Cox
Chuang Gan
LLMAG
LRM
94
5
0
04 Feb 2025
Dynamic Multimodal Evaluation with Flexible Complexity by
  Vision-Language Bootstrapping
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang
S. Zhang
Wenqi Shao
Kaipeng Zhang
Yi Bin
Yu Wang
Ping Luo
28
3
0
11 Oct 2024
Quantifying Generalization Complexity for Large Language Models
Quantifying Generalization Complexity for Large Language Models
Zhenting Qi
Hongyin Luo
Xuliang Huang
Zhuokai Zhao
Yibo Jiang
Xiangjun Fan
Himabindu Lakkaraju
James Glass
LRM
ELM
26
5
0
02 Oct 2024
Interactive Speculative Planning: Enhance Agent Efficiency through
  Co-design of System and User Interface
Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface
Wenyue Hua
Mengting Wan
Shashank Vadrevu
Ryan Nadel
Yongfeng Zhang
Chi Wang
LLMAG
24
1
0
30 Sep 2024
AIPatient: Simulating Patients with EHRs and LLM Powered Agentic
  Workflow
AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow
Huizi Yu
Jiayan Zhou
Lingyao Li
Shan Chen
Jack Gallifant
...
Themistocles L. Assimes
Xin Ma
Danielle S. Bitterman
Lin Lu
Lizhou Fan
55
4
0
27 Sep 2024
Can Large Language Models Reason? A Characterization via 3-SAT
Can Large Language Models Reason? A Characterization via 3-SAT
Rishi Hazra
Gabriele Venturato
Pedro Zuidberg Dos Martires
Luc de Raedt
ELM
ReLM
LRM
25
4
0
13 Aug 2024
Uncertainty is Fragile: Manipulating Uncertainty in Large Language
  Models
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Qingcheng Zeng
Mingyu Jin
Qinkai Yu
Zhenting Wang
Wenyue Hua
...
Felix Juefei Xu
Kaize Ding
Fan Yang
Ruixiang Tang
Yongfeng Zhang
AAML
31
9
0
15 Jul 2024
UniGen: A Unified Framework for Textual Dataset Generation Using Large
  Language Models
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Siyuan Wu
Yue Huang
Chujie Gao
Dongping Chen
Qihui Zhang
...
Tianyi Zhou
Xiangliang Zhang
Jianfeng Gao
Chaowei Xiao
Lichao Sun
SyDa
33
22
0
27 Jun 2024
SEC-QA: A Systematic Evaluation Corpus for Financial QA
SEC-QA: A Systematic Evaluation Corpus for Financial QA
Viet Dac Lai
Michael Krumdick
Charles Lovering
Varshini Reddy
Craig W. Schmidt
Chris Tanner
43
3
0
20 Jun 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Shu Wang
Xing Xie
Xing Xie
ALM
ELM
50
5
0
20 Jun 2024
Benchmark Data Contamination of Large Language Models: A Survey
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
36
38
0
06 Jun 2024
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial
  Actions across X Community Notes and Wikipedia edits
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer
Aleksandar Shtedritski
Samuel Albanie
Philip H. S. Torr
João F. Henriques
Jakob N. Foerster
22
1
0
05 Jun 2024
Disentangling Logic: The Role of Context in Large Language Model
  Reasoning Capabilities
Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities
Wenyue Hua
Kaijie Zhu
Lingyao Li
Lizhou Fan
Shuhang Lin
Mingyu Jin
Haochen Xue
Zelong Li
Jindong Wang
Yongfeng Zhang
LRM
59
8
0
04 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
30
35
0
03 Jun 2024
Towards Incremental Learning in Large Language Models: A Critical Review
Towards Incremental Learning in Large Language Models: A Critical Review
M. Jovanovic
Peter Voss
ELM
CLL
KELM
26
5
0
28 Apr 2024
BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to
  Complement Historical Analysis
BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical Analysis
Shuhang Lin
Wenyue Hua
Lingyao Li
Che-Jui Chang
Lizhou Fan
Jianchao Ji
Hang Hua
Mingyu Jin
Jiebo Luo
Yongfeng Zhang
LM&Ro
LLMAG
46
9
0
23 Apr 2024
A Survey of Large Language Models on Generative Graph Analytics: Query,
  Learning, and Applications
A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications
Wenbo Shang
Xin Huang
27
9
0
23 Apr 2024
CausalBench: A Comprehensive Benchmark for Causal Learning Capability of
  Large Language Models
CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models
Yu Zhou
Xingyu Wu
Beichen Huang
Jibin Wu
Liang Feng
Kay Chen Tan
ELM
CML
40
2
0
09 Apr 2024
Norm Violation Detection in Multi-Agent Systems using Large Language
  Models: A Pilot Study
Norm Violation Detection in Multi-Agent Systems using Large Language Models: A Pilot Study
Shawn He
Surangika Ranathunga
Stephen Cranefield
B. Savarimuthu
LLMAG
19
1
0
25 Mar 2024
Large Language Models in Biomedical and Health Informatics: A
  Bibliometric Review
Large Language Models in Biomedical and Health Informatics: A Bibliometric Review
Huizi Yu
Lizhou Fan
Lingyao Li
Jiayan Zhou
Zihui Ma
...
Sijia He
Mingyu Jin
Yongfeng Zhang
Ashvin Gandhi
Xin Ma
LM&MA
32
11
0
24 Mar 2024
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language
  Models
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models
Lizhou Fan
Wenyue Hua
Xiang Li
Kaijie Zhu
Mingyu Jin
...
Haoyang Ling
Jinkui Chi
Jindong Wang
Xin Ma
Yongfeng Zhang
LRM
35
14
0
04 Mar 2024
Dynamic Evaluation of Large Language Models by Meta Probing Agents
Dynamic Evaluation of Large Language Models by Meta Probing Agents
Kaijie Zhu
Jindong Wang
Qinlin Zhao
Ruochen Xu
Xing Xie
40
30
0
21 Feb 2024
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
  Evaluation
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
Xuanjing Huang
LLMAG
10
26
0
18 Feb 2024
On Catastrophic Inheritance of Large Foundation Models
On Catastrophic Inheritance of Large Foundation Models
Hao Chen
Bhiksha Raj
Xing Xie
Jindong Wang
AI4CE
48
12
0
02 Feb 2024
War and Peace (WarAgent): Large Language Model-based Multi-Agent
  Simulation of World Wars
War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars
Wenyue Hua
Lizhou Fan
Lingyao Li
Kai Mei
Jianchao Ji
Yingqiang Ge
Libby Hemphill
Yongfeng Zhang
LM&Ro
LLMAG
125
87
0
28 Nov 2023
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for
  Reasoning Problems
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Kaya Stechly
Matthew Marquez
Subbarao Kambhampati
LRM
155
84
0
19 Oct 2023
A Bibliometric Review of Large Language Models Research from 2017 to
  2023
A Bibliometric Review of Large Language Models Research from 2017 to 2023
Lizhou Fan
Lingyao Li
Zihui Ma
Sanggyu Lee
Huizi Yu
Libby Hemphill
26
146
0
03 Apr 2023
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
233
2,413
0
06 Oct 2022
Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
291
4,048
0
24 May 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
1