ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.11592
  4. Cited By
UniDoc: A Universal Large Multimodal Model for Simultaneous Text
  Detection, Recognition, Spotting and Understanding
v1v2 (latest)

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

19 August 2023
Hao Feng
Zijian Wang
Jingqun Tang
Jinghui Lu
Wen-gang Zhou
Houqiang Li
Can Huang
    MLLMVLM
ArXiv (abs)PDFHTML

Papers citing "UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding"

17 / 17 papers shown
Title
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Somraj Gautam
Abhirama Subramanyam Penamakuri
Abhishek Bhandari
Gaurav Harit
LMTDLRM
242
2
0
24 Aug 2025
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu
Zhibo Yang
Yuliang Liu
Xiang Bai
MLLMOffRLLRM
72
4
0
12 Aug 2025
HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
Xiwen Li
Ross T. Whitaker
Tolga Tasdizen
214
0
0
15 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Jiuxiang Gu
Franck Dernoncourt
Wanrong Zhu
Wanrong Zhu
Tianyi Zhou
Tong Sun
411
12
0
07 Apr 2025
TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
Yuheng Feng
Jianhui Wang
Kun Li
Sida Li
Tianyu Shi
Haoyue Han
Miao Zhang
Xueqian Wang
DiffM
1.1K
0
0
22 Mar 2025
A Token-level Text Image Foundation Model for Document Understanding
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Yunbo Wang
VLM
510
4
0
04 Mar 2025
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive ReviewAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Pei Fu
Tongkun Guan
Zining Wang
Zhentao Guo
Chen Duan
...
Boming Chen
Jiayao Ma
Qianyi Jiang
Kai Zhou
Junfeng Luo
VLM
383
2
0
23 Feb 2025
Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Fuxi Ling
Hongye Liu
Guoqiang Huang
Jing Li
Hong Wu
Zhihao Tang
407
0
0
02 Feb 2025
Visual Large Language Models for Generalized and Specialized Applications
Jiayi Zhang
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
430
33
0
06 Jan 2025
First-place Solution for Streetscape Shop Sign Recognition Competition
First-place Solution for Streetscape Shop Sign Recognition Competition
Bin Wang
Li Jing
947
0
0
06 Jan 2025
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition
  and Analysis
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis
Lei Chen
Feng Yan
Yujie Zhong
Shaoxiang Chen
Zequn Jie
Lin Ma
334
3
0
03 Jul 2024
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
...
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
576
34
0
02 Jul 2024
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang
Qi-dong Liu
Yongjie Ye
Jinghui Lu
Shubo Wei
...
Hao Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
686
48
0
20 May 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
LRMVLMMLLM
432
42
0
19 Apr 2024
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
  Large Language Models
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Ya-Qi Yu
Minghui Liao
Jihao Wu
Yongxin Liao
Xiaoyu Zheng
Wei Zeng
VLM
197
21
0
14 Apr 2024
UPOCR: Towards Unified Pixel-Level OCR Interface
UPOCR: Towards Unified Pixel-Level OCR InterfaceInternational Conference on Machine Learning (ICML), 2023
Dezhi Peng
Zhenhua Yang
Jiaxin Zhang
Chongyu Liu
Yongxin Shi
Kai Ding
Fengjun Guo
Lianwen Jin
329
13
0
05 Dec 2023
Deep Unrestricted Document Image Rectification
Deep Unrestricted Document Image RectificationIEEE transactions on multimedia (IEEE TMM), 2023
Hao Feng
Shaokai Liu
Jiajun Deng
Wen-gang Zhou
Houqiang Li
ViT
269
24
0
18 Apr 2023
1