Dongba pictographs are the only pictographs still in use in the world. They have pictorial ideographic features, and their symbols carry rich cultural and contextual information. Due to the lack of relevant datasets, existing research has difficulty in advancing the study of semantic understanding of Dongba pictographs. To this end, we propose \textbf{DongbaMIE}, the first multimodal dataset for semantic understanding and extraction of Dongba pictographs, consisting of Dongba pictograph images and corresponding Chinese semantic annotations. DongbaMIE contains 23,530 sentence-level and 2,539 paragraph-level images, covering four semantic dimensions: objects, actions, relations, and attributes. We systematically evaluate multimodal large language models (MLLMs), such as GPT-4o, Gemini-2.0, and Qwen2-VL. Experimental results show that best F1 scores of proprietary models, GPT-4o and Gemini, for object extraction task are only 3.16 and 3.11 respectively. For the open-source model Qwen2-VL, it achieves only 11.49 after supervised fine-tuning. These suggest that current MLLMs still face significant challenges in accurately recognizing diverse semantic information in Dongba pictographs.
View on arXiv@article{bi2025_2503.03644, title={ DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms }, author={ Xiaojun Bi and Shuo Li and Ziyue Wang and Fuwen Luo and Weizheng Qiao and Lu Han and Ziwei Sun and Peng Li and Yang Liu }, journal={arXiv preprint arXiv:2503.03644}, year={ 2025 } }