ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.01869
  4. Cited By
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language
  Models -- A Survey

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

2 April 2024
Philipp Mondorf
Barbara Plank
    ELMLRMLM&MA
ArXiv (abs)PDFHTML

Papers citing "Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey"

33 / 33 papers shown
Title
Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning
Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning
Zhonghao He
Tianyi Qiu
Hirokazu Shirado
Maarten Sap
LRM
72
0
0
02 Dec 2025
"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios
Zhen Sun
Zongmin Zhang
Deqi Liang
Han Sun
Yule Liu
...
Xiangshan Gao
Yilong Yang
Shuai Liu
Yutao Yue
Xinlei He
AAML
120
1
0
20 Nov 2025
On the Notion that Language Models Reason
On the Notion that Language Models Reason
Bertram Højer
ReLMLRM
167
0
0
14 Nov 2025
How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation Dataset
How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation Dataset
Sunil Mohan
Theofanis Karaletsos
92
0
0
09 Nov 2025
Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic
Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic
Manuel Vargas Guzmán
Jakub Szymanik
Maciej Malicki
NAILRMELM
54
0
0
10 Oct 2025
Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization
Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization
Mohammad Mahdi Samiei Paqaleh
Arash Marioriyad
Arman Tahmasebi-Zadeh
Mohamadreza Fereydooni
Mahdi Ghaznavai
Mahdieh Soleymani Baghshah
116
0
0
06 Oct 2025
LLMs and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue
LLMs and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue
Katharine Kowalyshyn
Matthias Scheutz
88
0
0
02 Sep 2025
Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants
Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants
Alessio Galatolo
Luca Alberto Rappuoli
Katie Winkle
Meriem Beloucif
ELM
122
1
0
18 Aug 2025
HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation
HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation
Gershom Seneviratne
Jianyu An
Sahire Ellahy
K. Weerakoon
Mohamed Bashir Elnoor
Jonathan Deepak Kannan
Amogha Thalihalla Sunil
Dinesh Manocha
OffRL
163
1
0
03 Aug 2025
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
Wonjun Jeong
Dongseok Kim
Taegkeun Whangbo
199
1
0
24 Jul 2025
Mechanistic Indicators of Understanding in Large Language Models
Mechanistic Indicators of Understanding in Large Language Models
Pierre Beckmann
Matthieu Queloz
207
1
0
07 Jul 2025
Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
Anselm R. Strohmaier
Wim Van Dooren
Kathrin Seßler
Brian Greer
Lieven Verschaffel
LRM
92
0
0
30 Jun 2025
Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
Younwoo Choi
Changling Li
Yongjin Yang
Zhijing Jin
LLMAG
148
1
0
28 Jun 2025
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Yuan Gao
Mattia Piccinini
Yuchen Zhang
Dingrui Wang
Korbinian Moller
...
Steven Peters
Andrea Stocco
Bassam Alrifaee
Marco Pavone
Johannes Betz
223
17
0
13 Jun 2025
BF-Max: an Efficient Bit Flipping Decoder with Predictable Decoding Failure RateInternational Symposium on Information Theory (ISIT), 2025
Alessio Baldelli
Marco Baldi
F. Chiaraluce
Paolo Santini
328
2
0
11 Jun 2025
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
Jiayu Wang
Yifei Ming
Zixuan Ke
Caiming Xiong
Shafiq Joty
Aws Albarghouthi
Frederic Sala
OffRLReLMLRM
317
0
0
05 Jun 2025
Counterfactual reasoning: an analysis of in-context emergence
Counterfactual reasoning: an analysis of in-context emergence
Moritz Miller
Bernhard Schölkopf
Siyuan Guo
ReLMLRM
339
0
0
05 Jun 2025
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Xiang Li
Haiyang Yu
Xinghua Zhang
Ziyang Huang
Shizhu He
Kang Liu
Jun Zhao
Fei Huang
Yongbin Li
LRM
166
2
0
29 May 2025
Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Adnan Oomerjee
Zafeirios Fountas
Haitham Bou-Ammar
Haitham Bou-Ammar
324
0
0
22 May 2025
Teaching Small Language Models to Learn Logic through Meta-Learning
Teaching Small Language Models to Learn Logic through Meta-Learning
Leonardo Bertolazzi
Manuel Vargas Guzmán
Raffaella Bernardi
Maciej Malicki
Jakub Szymanik
ReLMLRM
278
0
0
20 May 2025
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
Emre Can Acikgoz
Cheng Qian
Hongru Wang
Vardhan Dongre
Xiusi Chen
Heng Ji
Dilek Hakkani-Tur
Gokhan Tur
LM&RoELM
510
4
0
07 Apr 2025
Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey
Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey
Xiaoou Liu
Tiejin Chen
Longchao Da
Chacha Chen
Zhen Lin
Hua Wei
HILM
464
37
0
20 Mar 2025
Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset
Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset
Rawand Alfugaha
Mohammad AL-Smadi
LRMELM
197
0
0
19 Feb 2025
Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking
Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking
Alireza S. Ziabari
Nona Ghazizadeh
Zhivar Sourati
Farzan Karimi-Malekabadi
Payam Piray
Morteza Dehghani
LRM
235
13
0
18 Feb 2025
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
Cheryl Li
Tianyuan Xu
Yiwen Guo
LRM
1.1K
8
0
05 Feb 2025
Mathematical Language Models: A Survey
Mathematical Language Models: A Survey
Wen Liu
Hanglei Hu
Jie Zhou
Yuyang Ding
Junsong Li
...
Mengliang He
Qin Chen
Bo Jiang
Aimin Zhou
Liang He
LRM
544
21
0
03 Jan 2025
LLM-based Discriminative Reasoning for Knowledge Graph Question Answering
LLM-based Discriminative Reasoning for Knowledge Graph Question Answering
Mufan Xu
Kai Chen
Xuefeng Bai
Muyun Yang
Tiejun Zhao
Min Zhang
362
5
0
17 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELMAILaw
1.1K
251
0
25 Nov 2024
DeFine: Decision-Making with Analogical Reasoning over Factor Profiles
DeFine: Decision-Making with Analogical Reasoning over Factor ProfilesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yebowen Hu
Xiaoyang Wang
Wenlin Yao
Yiming Lu
Daoan Zhang
H. Foroosh
Dong Yu
Fei Liu
324
4
0
02 Oct 2024
Uncovering Latent Chain of Thought Vectors in Language Models
Uncovering Latent Chain of Thought Vectors in Language Models
Jason Zhang
Scott Viteri
LLMSVLRM
395
8
0
21 Sep 2024
ChatBCG: Can AI Read Your Slide Deck?
ChatBCG: Can AI Read Your Slide Deck?
Nikita Singh
Rob Balian
Lukas Martinelli
108
0
0
16 Jul 2024
SeqMate: A Novel Large Language Model Pipeline for Automating RNA
  Sequencing
SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing
Devam Mondal
Atharva Inamdar
119
2
0
02 Jul 2024
CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning
CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-TuningAAAI Conference on Artificial Intelligence (AAAI), 2024
Peiyuan Liu
Hang Guo
Tao Dai
Naiqi Li
Jigang Bao
Xudong Ren
Yong Jiang
Shu-Tao Xia
AI4TS
382
78
0
12 Mar 2024
1