187
v1v2v3v4 (latest)

Can MLLMs generate human-like feedback in grading multimodal short answers?

Main:11 Pages
4 Figures
5 Tables
Appendix:3 Pages
Abstract

In education, the traditional Automatic Short Answer Grading (ASAG) with feedback problem has focused primarily on evaluating text-only responses. However, real-world assessments often include multimodal responses containing both diagrams and text. To address this limitation, we introduce the Multimodal Short Answer Grading with Feedback (MMSAF) problem, which requires jointly evaluating textual and diagrammatic content while also providing explanatory feedback. Collecting data representative of such multimodal responses is challenging due to both scale and logistical constraints. To mitigate this, we develop an automated data generation framework that leverages LLM hallucinations to mimic common student errors, thereby constructing a dataset of 2,197 instances. We evaluate 4 Multimodal Large Language Models (MLLMs) across 3 STEM subjects, showing that MLLMs achieve accuracies of up to 62.5% in predicting answer correctness (correct/partially correct/incorrect) and up to 80.36% in assessing image relevance. This also includes a human evaluation with 9 annotators across 5 parameters, including a rubric-based approach. The rubrics also serve as a way to evaluate the feedback quality semantically rather than using overlap-based approaches. Our findings highlight which MLLMs are better suited for such tasks while also pointing out to drawbacks of the remaining MLLMs.

View on arXiv
Comments on this paper