ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.10885
31
0

PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

15 April 2025
Zeyu Zhang
Z. Chen
Zicheng Zhang
Yuze Sun
Yuan Tian
Ziheng Jia
Chunyi Li
Xiaohong Liu
Xiongkuo Min
Guangtao Zhai
    MLLM
ArXivPDFHTML
Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.

View on arXiv
@article{zhang2025_2504.10885,
  title={ PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving },
  author={ Zeyu Zhang and Zijian Chen and Zicheng Zhang and Yuze Sun and Yuan Tian and Ziheng Jia and Chunyi Li and Xiaohong Liu and Xiongkuo Min and Guangtao Zhai },
  journal={arXiv preprint arXiv:2504.10885},
  year={ 2025 }
}
Comments on this paper