ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.11592
  4. Cited By
UniDoc: A Universal Large Multimodal Model for Simultaneous Text
  Detection, Recognition, Spotting and Understanding

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

19 August 2023
Hao Feng
Zijian Wang
Jingqun Tang
Jinghui Lu
Wen-gang Zhou
Houqiang Li
Can Huang
    MLLM
    VLM
ArXivPDFHTML

Papers citing "UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding"

40 / 40 papers shown
Title
Representation Learning for Tabular Data: A Comprehensive Survey
Representation Learning for Tabular Data: A Comprehensive Survey
Jun-Peng Jiang
Si-Yang Liu
Hao-Run Cai
Qile Zhou
Han-Jia Ye
LMTD
38
0
0
17 Apr 2025
Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
Xiwen Li
Ross T. Whitaker
Tolga Tasdizen
25
0
0
15 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Yufan Zhou
Franck Dernoncourt
Wanrong Zhu
Tianyi Zhou
Tong Sun
30
2
0
07 Apr 2025
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
Xiao-Hui Li
Fei Yin
Cheng-Lin Liu
23
0
0
05 Apr 2025
TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
Yuheng Feng
Jianhui Wang
Kun Li
Sida Li
Tianyu Shi
Haoyue Han
Miao Zhang
Xueqian Wang
DiffM
53
0
0
22 Mar 2025
A Token-level Text Image Foundation Model for Document Understanding
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei-Ming Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Xiaokang Yang
VLM
43
0
0
04 Mar 2025
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Pei Fu
Tongkun Guan
Zining Wang
Zhentao Guo
Chen Duan
...
Boming Chen
Jiayao Ma
Qianyi Jiang
Kai Zhou
Junfeng Luo
VLM
53
0
0
23 Feb 2025
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
Wenwen Yu
Zhibo Yang
Jianqiang Wan
Sibo Song
J. Tang
Wenqing Cheng
Y. Liu
Xiang Bai
43
1
0
22 Feb 2025
Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Fuxi Ling
Hongye Liu
Guoqiang Huang
Jing Li
Hong Wu
Zhihao Tang
56
0
0
02 Feb 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
83
10
0
06 Jan 2025
First-place Solution for Streetscape Shop Sign Recognition Competition
First-place Solution for Streetscape Shop Sign Recognition Competition
Bin Wang
Li Jing
64
0
0
06 Jan 2025
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
23
5
0
15 Oct 2024
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene
  Understanding
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
VLM
22
0
0
30 Aug 2024
Platypus: A Generalized Specialist Model for Reading Text in Various
  Forms
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Peng Wang
Zhaohai Li
Jun Tang
Humen Zhong
Fei Huang
Zhibo Yang
Cong Yao
VLM
ObjD
31
2
0
27 Aug 2024
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Wenhui Liao
Jiapeng Wang
Hongliang Li
Chengyu Wang
Jun Huang
Lianwen Jin
33
0
0
27 Aug 2024
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language
  Models
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang
Yuliang Liu
Dingkang Liang
Lianwen Jin
Xiang Bai
37
9
0
04 Aug 2024
Harmonizing Visual Text Comprehension and Generation
Harmonizing Visual Text Comprehension and Generation
Zhen Zhao
Jingqun Tang
Binghong Wu
Chunhui Lin
Shubo Wei
Hao Liu
Xin Tan
Zhizhong Zhang
Can Huang
Yuan Xie
VLM
26
21
0
23 Jul 2024
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition
  and Analysis
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis
Lei Chen
Feng Yan
Yujie Zhong
Shaoxiang Chen
Zequn Jie
Lin Ma
34
3
0
03 Jul 2024
A Bounding Box is Worth One Token: Interleaving Layout and Text in a
  Large Language Model for Document Understanding
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
...
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
48
17
0
02 Jul 2024
DocKylin: A Large Multimodal Model for Visual Document Understanding
  with Efficient Visual Slimming
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang
Wentao Yang
Songxuan Lai
Zecheng Xie
Lianwen Jin
32
15
0
27 Jun 2024
TabPedia: Towards Comprehensive Visual Table Understanding with Concept
  Synergy
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
Weichao Zhao
Hao Feng
Qi Liu
Jingqun Tang
Shubo Wei
...
Lei Liao
Yongjie Ye
Hao Liu
Houqiang Li
Can Huang
LMTD
26
17
0
03 Jun 2024
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang
Qi Liu
Yongjie Ye
Jinghui Lu
Shubo Wei
...
Yanjie Wang
Yuliang Liu
Hao Liu
Xiang Bai
Can Huang
32
21
0
20 May 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Hao Liu
Yuan Xie
Xiang Bai
Can Huang
LRM
VLM
MLLM
61
26
0
19 Apr 2024
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
  Large Language Models
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Ya-Qi Yu
Minghui Liao
Jihao Wu
Yongxin Liao
Xiaoyu Zheng
Wei Zeng
VLM
19
15
0
14 Apr 2024
OmniParser: A Unified Framework for Text Spotting, Key Information
  Extraction and Table Recognition
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition
Jianqiang Wan
Sibo Song
Wenwen Yu
Yuliang Liu
Wenqing Cheng
Fei Huang
Xiang Bai
Cong Yao
Zhibo Yang
37
26
0
28 Mar 2024
TextMonkey: An OCR-Free Large Multimodal Model for Understanding
  Document
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu
Biao Yang
Qiang Liu
Zhang Li
Zhiyin Ma
Shuo Zhang
Xiang Bai
MLLM
VLM
36
87
0
07 Mar 2024
Enhancing Visual Document Understanding with Contrastive Learning in
  Large Visual-Language Models
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
Xin Li
Yunfei Wu
Xinghua Jiang
Zhihao Guo
Ming Gong
Haoyu Cao
Yinsong Liu
Deqiang Jiang
Xing Sun
VLM
29
12
0
29 Feb 2024
LAPDoc: Layout-Aware Prompting for Documents
LAPDoc: Layout-Aware Prompting for Documents
Marcel Lamott
Yves-Noel Weweler
A. Ulges
Faisal Shafait
Dirk Krechel
Darko Obradovic
38
5
0
15 Feb 2024
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Ashish Shenoy
Yichao Lu
Srihari Jayakumar
Debojeet Chatterjee
Mohsen Moslehpour
...
Shicong Zhao
Longfang Zhao
Ankit Ramchandani
Xin Luna Dong
Anuj Kumar
MLLM
22
1
0
12 Feb 2024
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity
  Recognition
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
Jinghui Lu
Ziwei Yang
Yanjie Wang
Xuejing Liu
Brian Mac Namee
Can Huang
MoE
45
4
0
07 Feb 2024
UPOCR: Towards Unified Pixel-Level OCR Interface
UPOCR: Towards Unified Pixel-Level OCR Interface
Dezhi Peng
Zhenhua Yang
Jiaxin Zhang
Chongyu Liu
Yongxin Shi
Kai Ding
Fengjun Guo
Lianwen Jin
21
10
0
05 Dec 2023
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
  Language Model
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Anwen Hu
Yaya Shi
Haiyang Xu
Jiabo Ye
Qinghao Ye
Mingshi Yan
Chenliang Li
Qi Qian
Ji Zhang
Fei Huang
MLLM
30
24
0
30 Nov 2023
DocPedia: Unleashing the Power of Large Multimodal Model in the
  Frequency Domain for Versatile Document Understanding
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
Hao Feng
Qi Liu
Hao Liu
Wen-gang Zhou
Houqiang Li
Can Huang
VLM
20
58
0
20 Nov 2023
UReader: Universal OCR-free Visually-situated Language Understanding
  with Multimodal Large Language Model
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
Jiabo Ye
Anwen Hu
Haiyang Xu
Qinghao Ye
Mingshi Yan
...
Ji Zhang
Qin Jin
Liang He
Xin Lin
Feiyan Huang
VLM
MLLM
121
83
0
08 Oct 2023
On the Hidden Mystery of OCR in Large Multimodal Models
On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu
Zhang Li
Mingxin Huang
Chunyuan Li
Dezhi Peng
Mingyu Liu
Lianwen Jin
Xiang Bai
VLM
MLLM
16
47
0
13 May 2023
mPLUG-Owl: Modularization Empowers Large Language Models with
  Multimodality
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLM
MLLM
203
883
0
27 Apr 2023
Deep Unrestricted Document Image Rectification
Deep Unrestricted Document Image Rectification
Hao Feng
Shaokai Liu
Jiajun Deng
Wen-gang Zhou
Houqiang Li
ViT
8
13
0
18 Apr 2023
Grab What You Need: Rethinking Complex Table Structure Recognition with
  Flexible Components Deliberation
Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation
Hao Liu
Xin Li
Ming Gong
Bin Liu
Yunfei Wu
Deqiang Jiang
Yinsong Liu
Xing Sun
LMTD
17
5
0
16 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
Pix2seq: A Language Modeling Framework for Object Detection
Pix2seq: A Language Modeling Framework for Object Detection
Ting-Li Chen
Saurabh Saxena
Lala Li
David J. Fleet
Geoffrey E. Hinton
MLLM
ViT
VLM
233
341
0
22 Sep 2021
1