Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision--language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise.

View on arXiv

Comments on this paper