Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.01976
Cited By
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
2 July 2024
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
Ziwei Yang
Binghong Wu
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding"
22 / 22 papers shown
Title
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
Alexander Vogel
Omar Moured
Yufan Chen
Jiaming Zhang
Rainer Stiefelhagen
35
0
0
29 Mar 2025
A Simple yet Effective Layout Token in Large Language Models for Document Understanding
Zhaoqing Zhu
Chuwei Luo
Zirui Shao
Feiyu Gao
Hangdi Xing
Qi Zheng
Ji Zhang
44
0
0
24 Mar 2025
TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
Yuheng Feng
Jianhui Wang
Kun Li
Sida Li
Tianyu Shi
Haoyue Han
Miao Zhang
Xueqian Wang
DiffM
50
0
0
22 Mar 2025
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Zining Wang
Tongkun Guan
Pei Fu
Chen Duan
Qianyi Jiang
Zhentao Guo
Shan Guo
Junfeng Luo
Wei-Ming Shen
Xiaokang Yang
MLLM
VLM
69
0
0
18 Mar 2025
CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
Yuxuan Luo
Jiaqi Tang
Chenyi Huang
Feiyang Hao
Zhouhui Lian
VLM
56
0
0
13 Mar 2025
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei-Ming Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Xiaokang Yang
VLM
43
0
0
04 Mar 2025
Task-Oriented 6-DoF Grasp Pose Detection in Clutters
An-Lan Wang
Nuo Chen
Kun-Yu Lin
Li Yuan-Ming
Wei-Shi Zheng
46
2
0
24 Feb 2025
Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Fuxi Ling
Hongye Liu
Guoqiang Huang
Jing Li
Hong Wu
Zhihao Tang
53
0
0
02 Feb 2025
First-place Solution for Streetscape Shop Sign Recognition Competition
Bin Wang
Li Jing
55
0
0
06 Jan 2025
SAIL: Sample-Centric In-Context Learning for Document Information Extraction
Jinyu Zhang
Zhiyuan You
Jize Wang
Xinyi Le
64
0
0
22 Dec 2024
DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness
Ahmad Mohammadshirazi
Pinaki Prasad Guha Neogi
Ser-Nam Lim
R. Ramnath
65
1
0
29 Nov 2024
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Wenhui Liao
Jiapeng Wang
Hongliang Li
Chengyu Wang
Jun Huang
Lianwen Jin
23
0
0
27 Aug 2024
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
Weichao Zhao
Hao Feng
Qi Liu
Jingqun Tang
Shubo Wei
...
Lei Liao
Yongjie Ye
Hao Liu
Houqiang Li
Can Huang
LMTD
26
17
0
03 Jun 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Hao Liu
Yuan Xie
Xiang Bai
Can Huang
LRM
VLM
MLLM
50
26
0
19 Apr 2024
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization
Dongsheng Zhu
Xunzhu Tang
Weidong Han
Jinghui Lu
Yukun Zhao
Guoliang Xing
Junfeng Wang
Dawei Yin
VLM
MLLM
46
7
0
12 Feb 2024
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
Jinghui Lu
Ziwei Yang
Yanjie Wang
Xuejing Liu
Brian Mac Namee
Can Huang
MoE
42
4
0
07 Feb 2024
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
...
Conghui He
Xingcheng Zhang
Yu Qiao
Dahua Lin
Jiaqi Wang
VLM
MLLM
73
242
0
29 Jan 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
135
895
0
21 Dec 2023
LMDX: Language Model-based Document Information Extraction and Localization
Vincent Perot
Kai Kang
Florian Luisier
Guolong Su
Xiaoyu Sun
...
Zifeng Wang
Jiaqi Mu
Hao Zhang
Chen-Yu Lee
Nan Hua
48
29
0
19 Sep 2023
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
148
259
0
07 Oct 2022
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
...
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
ViT
MLLM
137
492
0
29 Dec 2020
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
Guillaume Jaume
H. K. Ekenel
Jean-Philippe Thiran
109
259
0
27 May 2019
1