Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.11175
Cited By
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
18 May 2023
Wen Wang
Zhe Chen
Xiaokang Chen
Jiannan Wu
Xizhou Zhu
Gang Zeng
Ping Luo
Tong Lu
Jie Zhou
Yu Qiao
Jifeng Dai
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks"
50 / 79 papers shown
Title
Vision and Intention Boost Large Language Model in Long-Term Action Anticipation
Congqi Cao
Lanshu Hu
Yating Yu
Y. Zhang
VLM
73
0
0
03 May 2025
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
52
0
0
03 May 2025
Foundation Model-Driven Framework for Human-Object Interaction Prediction with Segmentation Mask Integration
Juhan Park
Kyungjae Lee
Hyung Jin Chang
Jungchan Cho
VLM
66
0
0
28 Apr 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
76
0
0
28 Apr 2025
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Qianqian Sun
Jixiang Luo
Dell Zhang
Xuelong Li
DiffM
50
0
0
17 Apr 2025
TAGC: Optimizing Gradient Communication in Distributed Transformer Training
Igor Polyakov
Alexey Dukhanov
Egor Spirin
39
0
0
08 Apr 2025
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Bosung Kim
Kyuhwan Lee
Isu Jeong
Jungmin Cheon
Yeojin Lee
Seulki Lee
VGen
45
1
0
31 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Y. S. Rawat
VLM
75
1
0
11 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
64
3
0
10 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
49
1
0
05 Mar 2025
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding
Liangtao Shi
Ting Liu
Xiantao Hu
Yue Hu
Quanjun Yin
Richang Hong
ObjD
46
0
0
24 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
77
8
0
21 Feb 2025
Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey
Ruiyao Xu
Kaize Ding
53
5
0
17 Feb 2025
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
Haibo Tong
Zhaoyang Wang
Z. Chen
Haonian Ji
Shi Qiu
...
Peng Xia
Mingyu Ding
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVM
VGen
95
2
0
03 Feb 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
73
19
0
21 Jan 2025
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection
Yuanze Li
Haolin Wang
Shihao Yuan
Ming-Yu Liu
Debin Zhao
Yiwen Guo
Chen Xu
Guangming Shi
Wangmeng Zuo
79
28
0
20 Jan 2025
DriveLM: Driving with Graph Visual Question Answering
Chonghao Sima
Katrin Renz
Kashyap Chitta
L. Chen
Hanxue Zhang
Chengen Xie
Jens Beißwenger
Ping Luo
Andreas Geiger
Hongyang Li
84
160
0
17 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
121
2
0
14 Jan 2025
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Zhibei Ma
Alan L. Yuille
Lu Jiang
VGen
56
2
0
10 Jan 2025
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Ting Liu
Zunnan Xu
Yue Hu
Liangtao Shi
Zhiqiang Wang
Quanjun Yin
57
2
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
91
46
0
03 Jan 2025
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
Hang Zhang
Zhuoling Li
Jun Liu
LRM
100
1
0
15 Dec 2024
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing
Juncheng Hu
Guotao Liang
Jing Zhang
Dong Xu
Qian Yu
92
7
0
15 Dec 2024
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Chunlin Yu
Hanqing Wang
Ye Shi
Haoyang Luo
Sibei Yang
Jingyi Yu
Jingya Wang
LRM
LM&Ro
79
1
0
02 Dec 2024
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han
Yibo Hu
Mengxue Qu
Hailin Shi
Yao Zhao
Y. X. Wei
MLLM
VLM
3DV
83
1
0
29 Nov 2024
Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
Ji Hyeok Jung
Eun Tae Kim
S. Kim
Joo Ho Lee
Bumsoo Kim
Buru Chang
VLM
115
0
0
24 Nov 2024
Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Z. Chen
Yequan Bie
Haibo Jin
Hao Chen
110
0
0
23 Nov 2024
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Ruichuan An
Sihan Yang
Ming Lu
Kai Zeng
Yulin Luo
...
Hao Liang
Qi She
Shanghang Zhang
W. Zhang
Wentao Zhang
78
5
0
18 Nov 2024
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding
Y. Zhou
Mengcheng Lan
Xiang Li
Yiping Ke
Xue Jiang
Litong Feng
Qingyun Li
Xue Yang
Wayne Zhang
ObjD
VLM
112
4
0
16 Nov 2024
MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba
Masakazu Yoshimura
Teruaki Hayashi
Yota Maeda
Mamba
69
2
0
06 Nov 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Xiangyu Zeng
Kunchang Li
Chenting Wang
Xinhao Li
Tianxiang Jiang
...
Zhengrong Yue
Yi Wang
Yali Wang
Yu Qiao
Limin Wang
MLLM
VLM
AI4TS
64
14
0
25 Oct 2024
Locality Alignment Improves Vision-Language Models
Ian Covert
Tony Sun
James Y. Zou
Tatsunori Hashimoto
VLM
64
3
0
14 Oct 2024
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Yang Bai
Yang Zhou
Jun Zhou
Rick Siow Mong Goh
Daniel Ting
Yong Liu
VLM
44
0
0
09 Oct 2024
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Yu Su
LM&Ro
LLMAG
76
48
0
07 Oct 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu
X. Yang
Weiwei Li
Peng Wang
ObjD
39
3
0
23 Sep 2024
Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification
Fatema Jannat
Sina Gholami
Jennifer I. Lim
Theodore Leng
Minhaj Nur Alam
Hamed Tabkhi
28
0
0
17 Sep 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-xiong Wang
70
15
0
05 Sep 2024
Exploring the Potential of Large Language Models for Heterophilic Graphs
Yuxia Wu
Shujie Li
Yuan Fang
Chuan Shi
32
1
0
26 Aug 2024
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Murun Yang
...
Chunliang Zhang
Tongran Liu
Quan Du
Di Yang
Jingbo Zhu
VLM
64
4
0
22 Aug 2024
Visual Agents as Fast and Slow Thinkers
Guangyan Sun
Mingyu Jin
Zhenting Wang
Cheng-Long Wang
Siqi Ma
Qifan Wang
Ying Nian Wu
Ying Nian Wu
Dongfang Liu
Dongfang Liu
LLMAG
LRM
77
12
0
16 Aug 2024
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
Ofir Abramovich
Niv Nayman
Sharon Fogel
I. Lavi
Ron Litman
Shahar Tsiper
Royee Tichauer
Srikar Appalaraju
Shai Mazor
R. Manmatha
VLM
28
3
0
17 Jul 2024
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian
Hanrong Ye
J. Fauconnier
Peter Grasch
Yinfei Yang
Zhe Gan
108
13
0
01 Jul 2024
It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization
Bingdong Li
Zixiang Di
Yanting Yang
Hong Qian
Peng Yang
Hao Hao
Ke Tang
Aimin Zhou
MoMe
19
5
0
29 Jun 2024
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
Junwei Luo
Zhen Pang
Yongjun Zhang
Tingzhu Wang
Linlin Wang
...
Jiangwei Lao
Jian Wang
Jingdong Chen
Yihua Tan
Yansheng Li
28
20
0
14 Jun 2024
Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model
Kyeongjin Ahn
Sungwon Han
Sungwon Park
Jihee Kim
Sangyoon Park
Meeyoung Cha
18
2
0
12 Jun 2024
Are Large Language Models the New Interface for Data Pipelines?
Sylvio Barbon Junior
Paolo Ceravolo
Sven Groppe
Mustafa Jarrar
S. Maghool
Florence Sèdes
S. Sahri
M. van Keulen
LM&MA
29
8
0
06 Jun 2024
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang
Hehe Fan
Yi Yang
43
3
0
24 May 2024
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman
Rajat Chawla
Muskaan Kumar
Arkajit Datta
Adarsh Jha
NS Mukunda
Ishaan Bhola
40
2
0
24 May 2024
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing
Ying Jin
Pengyang Ling
Xiao-wen Dong
Pan Zhang
Jiaqi Wang
Dahua Lin
24
2
0
18 May 2024
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
An Yan
Zhengyuan Yang
Junda Wu
Wanrong Zhu
Jianwei Yang
...
K. Lin
Jianfeng Wang
Julian McAuley
Jianfeng Gao
Lijuan Wang
LRM
34
12
0
25 Apr 2024
1
2
Next