Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2305.12474
Cited By

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

v1v2v3 (latest)

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

21 May 2023

Xipeng Qiu

ArXiv (abs)PDF HTML

Papers citing "Evaluating the Performance of Large Language Models on GAOKAO Benchmark"

50 / 66 papers shown

RedOne 2.0: Rethinking Domain-specific LLM Post-Training in Social Networking Services

RedOne 2.0: Rethinking Domain-specific LLM Post-Training in Social Networking Services

...

203

0

0

10 Nov 2025

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

Abdellah El Mekki

Muhammad Abdul-Mageed

242

0

0

20 Oct 2025

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

...

AIMat AI4TS LRM

325

0

0

16 Oct 2025

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

...

225

2

0

15 Oct 2025

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Xuan-Phi Nguyen

183

3

0

15 Oct 2025

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

...

248

0

0

14 Oct 2025

Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning

Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning

136

5

0

13 Oct 2025

SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning

SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning

141

0

0

11 Oct 2025

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

Jonathan D. Chang

Prithviraj Ammanabrolu

160

0

0

01 Oct 2025

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

154

5

0

01 Oct 2025

Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

...

171

5

0

30 Sep 2025

PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning

PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning

188

0

0

17 Sep 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

...

304

279

0

25 Aug 2025

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

236

24

0

19 Aug 2025

From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

Chengliang Zhou

230

1

0

05 Aug 2025

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

...

151

15

0

05 Aug 2025

Technical Report of TeleChat2, TeleChat2.5 and T1

Technical Report of TeleChat2, TeleChat2.5 and T1

...

Shuangyong Song

426

6

0

24 Jul 2025

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

385

12

0

23 Jul 2025

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

259

6

0

23 Jul 2025

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

...

219

0

0

13 Jul 2025

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

162

0

0

18 Jun 2025

Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal LogicWorkshop on Innovative Use of NLP for Building Educational Applications (UNBEA), 2025

Rohith Reddy Nama

170

6

0

09 Jun 2025

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

219

4

0

07 Jun 2025

STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent FrameworkAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

300

4

0

02 Jun 2025

From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation

From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

236

1

0

01 Jun 2025

Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting

Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting

292

9

0

26 May 2025

Assessing the Capability of LLMs in Solving POSCOMP Questions

Assessing the Capability of LLMs in Solving POSCOMP Questions

Márcio Ribeiro

102

1

0

24 May 2025

T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

^2

: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

298

2

0

23 May 2025

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

547

13

0

21 May 2025

Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization

Rethinking Reward Model Evaluation Through the Lens of Reward OveroptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

332

2

0

19 May 2025

SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

...

479

0

0

12 May 2025

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

310

7

0

08 May 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

...

613

806

1

14 Apr 2025

Can the capability of Large Language Models be described by human ability? A Meta Study

Can the capability of Large Language Models be described by human ability? A Meta Study

256

1

0

13 Apr 2025

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

...

535

13

0

17 Mar 2025

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

...

LLMAG KELM LRM AI4CE

515

35

0

12 Mar 2025

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific ProblemsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

300

1

0

27 Feb 2025

Baichuan-M1: Pushing the Medical Capability of Large Language Models

...

LM&MA ELM AI4MH

384

32

0

18 Feb 2025

Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis

Improving Natural Language Understanding for LLMs via Large-Scale Instruction SynthesisAAAI Conference on Artificial Intelligence (AAAI), 2025

842

2

0

06 Feb 2025

UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models

UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models

809

23

0

01 Feb 2025

Baichuan-Omni-1.5 Technical Report

Tao Zhang

...

328

66

0

28 Jan 2025

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

524

185

0

22 Jan 2025

Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language ModelsJournal of Artificial Intelligence Research (JAIR), 2025

Kaleem Ullah Qasim

Ateeq Ur Rehman Butt

309

4

0

03 Jan 2025

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

...

524

184

1

15 Nov 2024

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Runtao Liu

LRM ReLM ELM AIMat

394

7

0

11 Nov 2024

Number Cookbook: Number Understanding of Language Models and How to Improve It

Number Cookbook: Number Understanding of Language Models and How to Improve ItInternational Conference on Learning Representations (ICLR), 2024

498

31

0

06 Nov 2024

Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

533

14

0

24 Oct 2024

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language ModelsThe Web Conference (WWW), 2024

Yazhou Zhang

Jing Qin

373

6

0

19 Sep 2024

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering
LLM Weaknesses

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Yang Liu

Ming Zhong

Yinghao Yang

Ziyi Yang

Yue Zhang

205

18

0

16 Aug 2024

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Zengzhi Wang

...

Yuxiang Zheng

Shaoting Zhang

Dahua Lin

Yu Qiao

299

72

0

18 Jun 2024