CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning

Visual instruction tuning is an important training stage for large multimodal models. Nevertheless, when learning multiple visual tasks simultaneously, this approach may lead to suboptimal and imbalanced overall performance due to latent knowledge conflicts across tasks. To mitigate this issue, we introduce a novel Comprehensive Task Balancing (CoTBal) algorithm tailored for multi-task visual instruction tuning. To our knowledge, this is the first work to explore multi-task optimization in visual instruction tuning. Specifically, we consider two critical dimensions for task balancing: (1) Inter-Task Contribution, which represents the phenomenon where learning one task could enhance the performance on others owing to the overlapping knowledge domains across tasks, and (2) Intra-Task Difficulty, which indicates the inherent learning difficulty of a single task. Furthermore, by quantifying these with performance-based metrics, comprehensive task balancing is thus achieved by assigning greater weight to tasks that offer substantial contributions to others, receive minimal contributions from others, and present high learning difficulties. Extensive experiments on three benchmarks demonstrate that our CoTBal algorithm results in superior and more balanced overall performance in multi-task visual instruction tuning.
View on arXiv@article{dai2025_2403.04343, title={ CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning }, author={ Yanqi Dai and Zebin You and Dong Jing and Yutian Luo and Nanyi Fei and Guoxing Yang and Zhiwu Lu }, journal={arXiv preprint arXiv:2403.04343}, year={ 2025 } }