ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.17167
  4. Cited By
DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

29 September 2023
A. Maritan
Jiaao Chen
S. Dey
Luca Schenato
Diyi Yang
Xing Xie
    ELM
    LRM
ArXivPDFHTML

Papers citing "DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks"

42 / 42 papers shown
Title
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
Jaime Raldua Veuthey
Zainab Ali Majid
Suhas Hariharan
Jacob Haimes
ELM
21
0
0
18 Apr 2025
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
Xiao Pu
Michael Stephen Saxon
Wenyue Hua
William Yang Wang
LRM
24
0
0
17 Apr 2025
Large language models could be rote learners
Large language models could be rote learners
Yuyang Xu
Renjun Hu
Haochao Ying
J. Wu
Xing Shi
Wei Lin
ELM
43
0
0
11 Apr 2025
DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
Atharva Pandey
Kshitij Dubey
Rahul Sharma
Amit Sharma
ReLM
ELM
LRM
47
0
0
09 Apr 2025
Generative Evaluation of Complex Reasoning in Large Language Models
Generative Evaluation of Complex Reasoning in Large Language Models
Haowei Lin
X. Wang
Ruilin Yan
Baizhou Huang
Haotian Ye
Jianhua Zhu
Zihao Wang
James Y. Zou
Jianzhu Ma
Yitao Liang
ReLM
ELM
LRM
76
0
0
03 Apr 2025
Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
Linxin Song
Xuwei Ding
Jieyu Zhang
Taiwei Shi
Ryotaro Shimizu
Rahul Gupta
Y. Liu
Jian Kang
Jieyu Zhao
KELM
54
0
0
30 Mar 2025
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun
Han Wang
Dongbai Li
Gang Wang
Huan Zhang
AAML
43
0
0
20 Mar 2025
Classification of User Reports for Detection of Faulty Computer Components using NLP Models: A Case Study
Classification of User Reports for Detection of Faulty Computer Components using NLP Models: A Case Study
Maria de Lourdes M. Silva
André L. C. Mendonça
Eduardo R. D. Neto
Iago C. Chaves
Felipe T. Brito
V. A. E. Farias
Javam C. Machado
35
0
0
20 Mar 2025
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More
Arvid Frydenlund
LRM
44
0
0
13 Mar 2025
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Simin Chen
Pranav Pusarla
Baishakhi Ray
61
0
0
06 Mar 2025
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets
Preetam Prabhu Srikar Dammu
Himanshu Naidu
Chirag Shah
42
0
0
06 Mar 2025
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models
Shengzhuang Chen
Yikai Liao
Xiaoxiao Sun
Kede Ma
Ying Wei
55
0
0
06 Mar 2025
Framing the Game: How Context Shapes LLM Decision-Making
Isaac Robinson
John Burden
38
0
0
05 Mar 2025
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training
Yihang Yao
Zhepeng Cen
Miao Li
William Jongwon Han
Yuyou Zhang
Emerson Liu
Zuxin Liu
Chuang Gan
Ding Zhao
ReLM
LRM
67
0
0
25 Feb 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
H. Li
Tao Xie
Baishakhi Ray
30
2
0
23 Feb 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
InductionBench: LLMs Fail in the Simplest Complexity Class
Wenyue Hua
Tyler Wong
Sun Fei
Liangming Pan
Adam Jardine
William Yang Wang
LRM
48
2
0
20 Feb 2025
SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models
SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models
Peter Carragher
Nikitha Rao
Abhinand Jha
R Raghav
Kathleen M. Carley
VLM
47
0
0
19 Feb 2025
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Guangxiang Zhao
Saier Hu
Xiaoqi Jian
Jinzhu Wu
Yuhan Wu
Change Jia
Lin Sun
Xiangzheng Zhang
66
0
0
18 Feb 2025
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
Peiwen Yuan
Shaoxiong Feng
Yiwei Li
X. U. Wang
Y. Zhang
Jiayi Shi
Chuyi Tan
Boyuan Pan
Yao Hu
Kan Li
59
2
0
02 Feb 2025
MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective
MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective
Hailang Huang
Yong Wang
Zixuan Huang
Huaqiu Li
Tongwen Huang
Xiangxiang Chu
Richong Zhang
MLLM
LM&MA
EGVM
83
0
0
21 Nov 2024
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction
Natasha Butt
Varun Chandrasekaran
Neel Joshi
Besmira Nushi
Vidhisha Balachandran
23
0
0
29 Oct 2024
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao
Yue Huang
Yanbo Wang
Jiayi Ye
Xiangqi Wang
Xiuying Chen
Mohamed Elhoseiny
X. Zhang
Mohamed Elhoseiny
Xiangliang Zhang
42
7
0
28 Oct 2024
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Wanying Wang
Zeyu Ma
Pengfei Liu
Mingang Chen
LLMAG
43
1
0
15 Oct 2024
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
Furu Wei
Furu Wei
38
2
0
14 Oct 2024
Clean Evaluations on Contaminated Visual Language Models
Clean Evaluations on Contaminated Visual Language Models
Hongyuan Lu
Shujie Miao
Wai Lam
MLLM
28
0
0
09 Oct 2024
TypedThinker: Diversify Large Language Model Reasoning with Typed Thinking
TypedThinker: Diversify Large Language Model Reasoning with Typed Thinking
Danqing Wang
Jianxin Ma
Fei Fang
Lei Li
LLMAG
LRM
45
0
0
02 Oct 2024
Reliable and diverse evaluation of LLM medical knowledge mastery
Reliable and diverse evaluation of LLM medical knowledge mastery
Yuxuan Zhou
Xien Liu
Chen Ning
Xiao Zhang
Ji Wu
MedIm
24
0
0
22 Sep 2024
Co-occurrence is not Factual Association in Language Models
Co-occurrence is not Factual Association in Language Models
Xiao Zhang
Miao Li
Ji Wu
KELM
51
2
0
21 Sep 2024
Benchmarking Language Model Creativity: A Case Study on Code Generation
Benchmarking Language Model Creativity: A Case Study on Code Generation
Yining Lu
Dixuan Wang
Tianjian Li
Dongwei Jiang
Daniel Khashabi
Meng Jiang
Daniel Khashabi
LRM
49
10
0
12 Jul 2024
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical
  Reasoning with Checklist
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou
Shudong Liu
Maizhen Ning
Wei Liu
Jindong Wang
Derek F. Wong
Xiaowei Huang
Qiufeng Wang
Kaizhu Huang
ELM
LRM
47
2
0
11 Jul 2024
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model
  Agents
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu
Linyao Chen
Dai-Jie Wu
Yanjun Chen
Zecheng Zhang
...
Shilong Liu
Bochen Qian
Philip H. S. Torr
Bernard Ghanem
G. Li
30
14
0
01 Jul 2024
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large
  Language Models
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models
Yiyuan Li
Shichao Sun
Pengfei Liu
LRM
41
0
0
01 Jul 2024
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human
  Curricula
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
Shubhra Mishra
Gabriel Poesia
Belinda Mo
Noah D. Goodman
26
3
0
01 Jul 2024
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase
  for Math Reasoning
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
Shangqing Tu
Kejian Zhu
Yushi Bai
Zijun Yao
Lei Hou
Juanzi Li
31
4
0
06 Jun 2024
Benchmarking Benchmark Leakage in Large Language Models
Benchmarking Benchmark Leakage in Large Language Models
Ruijie Xu
Zengzhi Wang
Run-Ze Fan
Pengfei Liu
53
42
0
29 Apr 2024
The Landscape of Emerging AI Agent Architectures for Reasoning,
  Planning, and Tool Calling: A Survey
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
Tula Masterman
Sandi Besen
Mason Sawtell
Alex Chao
LM&Ro
LLMAG
24
42
0
17 Apr 2024
TreeEval: Benchmark-Free Evaluation of Large Language Models through
  Tree Planning
TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
Xiang Li
Yunshi Lan
Chao Yang
ELM
32
7
0
20 Feb 2024
Don't Make Your LLM an Evaluation Benchmark Cheater
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
99
136
0
03 Nov 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,953
0
22 Mar 2023
GraphWorld: Fake Graphs Bring Real Insights for GNNs
GraphWorld: Fake Graphs Bring Real Insights for GNNs
John Palowitch
Anton Tsitsulin
Brandon Mayer
Bryan Perozzi
GNN
177
59
0
28 Feb 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Memorisation versus Generalisation in Pre-trained Language Models
Memorisation versus Generalisation in Pre-trained Language Models
Michael Tänzer
Sebastian Ruder
Marek Rei
78
50
0
16 Apr 2021
1