42
6

GLoRE: Evaluating Logical Reasoning of Large Language Models

Abstract

Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.

View on arXiv
@article{liu2025_2310.09107,
  title={ GLoRE: Evaluating Logical Reasoning of Large Language Models },
  author={ Hanmeng liu and Zhiyang Teng and Ruoxi Ning and Yiran Ding and Xiulai Li and Xiaozhang Liu and Yue Zhang },
  journal={arXiv preprint arXiv:2310.09107},
  year={ 2025 }
}
Comments on this paper