Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2206.08916
Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"
50 / 327 papers shown
Title
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
18
0
0
11 May 2025
The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction
Tom Sander
Moritz Tenthoff
Kay Wohlfarth
Christian Wöhler
19
0
0
08 May 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
76
0
0
28 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
J. Chen
Xiaoye Zhu
Y. Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
J. Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
36
0
0
24 Apr 2025
SignX: The Foundation Model for Sign Recognition
Sen Fang
Chunyu Sui
Hongwei Yi
C. Neidle
Dimitris N. Metaxas
SLR
30
0
0
22 Apr 2025
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Ziqi Pang
Xin Xu
Yu-Xiong Wang
DiffM
57
0
0
15 Apr 2025
GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions
Jo-Ku Cheng
Zeren Zhang
Ran Chen
Jingyang Deng
Ziran Qin
Jinwen Ma
28
0
0
14 Apr 2025
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Jialu Li
Shoubin Yu
Han Lin
Jaemin Cho
Jaehong Yoon
Mohit Bansal
DiffM
VGen
48
0
0
11 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Yufan Zhou
Franck Dernoncourt
Wanrong Zhu
Tianyi Zhou
Tong Sun
30
2
0
07 Apr 2025
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
52
0
0
01 Apr 2025
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Junzhu Mao
Yang Shen
Jinyang Guo
Yazhou Yao
Xiansheng Hua
ViT
31
0
0
30 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
36
0
0
29 Mar 2025
Unified Multimodal Discrete Diffusion
Alexander Swerdlow
Mihir Prabhudesai
Siddharth Gandhi
Deepak Pathak
Katerina Fragkiadaki
DiffM
68
0
0
26 Mar 2025
MMGen: Unified Multi-modal Image Generation and Understanding in One Go
Jiepeng Wang
Zhaoqing Wang
H. Pan
Yuan Liu
Dongdong Yu
Changhu Wang
Wenping Wang
DiffM
76
0
0
26 Mar 2025
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
Xuan Ju
Weicai Ye
Quande Liu
Qiulin Wang
Xintao Wang
Pengfei Wan
Di Zhang
Kun Gai
Qiang Xu
VGen
39
1
0
25 Mar 2025
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang
Changxu Cheng
Lingfeng Wang
Senda Chen
Wuyue Zhao
VLM
64
0
0
17 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
54
0
0
17 Mar 2025
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Tsu-jui Fu
Yusu Qian
Chen Chen
Wenze Hu
Zhe Gan
Y. Yang
85
1
0
16 Mar 2025
Direction-Aware Diagonal Autoregressive Image Generation
Yijia Xu
Jianzhong Ju
Jian Luan
J. Cui
47
0
0
14 Mar 2025
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Rui Yang
Lin Song
Yicheng Xiao
Runhui Huang
Yixiao Ge
Ying Shan
Hengshuang Zhao
MLLM
62
0
0
12 Mar 2025
Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis
N. Islam
Dongao Ma
Jiaxuan Pang
Shivasakthi Senthil Velan
Michael B. Gotway
Jianming Liang
53
0
0
12 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi-Long Han
Yandong Tang
Liangqiong Qu
40
0
0
10 Mar 2025
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Xiangxiang Chu
Renda Li
Yong Wang
60
0
0
08 Mar 2025
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning
Zhong Ji
Weilong Cao
Yan Zhang
Yanwei Pang
Jungong Han
X. Li
DiffM
VLM
37
0
0
06 Mar 2025
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Siyu Jiao
Gengwei Zhang
Yinlong Qian
Jiancheng Huang
Yao Zhao
Humphrey Shi
Lin Ma
Y. X. Wei
Zequn Jie
VLM
39
1
0
27 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
62
2
0
11 Feb 2025
Foundation Models for Anomaly Detection: Vision and Challenges
Jing Ren
Tao Tang
Hong Jia
Haytham Fayek
Xiaodong Li
Suyu Ma
Xiwei Xu
Feng Xia
53
0
0
10 Feb 2025
Adaptive Perception for Unified Visual Multi-modal Object Tracking
Xiantao Hu
Bineng Zhong
Qihua Liang
Zhiyi Mo
Liangtao Shi
Ying Tai
Jian Yang
33
1
0
10 Feb 2025
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction
Hammad A. Ayyubi
Xuande Feng
Junzhang Liu
Xudong Lin
Zhecan Wang
Shih-Fu Chang
35
0
0
24 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
30
0
0
20 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
88
45
0
03 Jan 2025
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
44
3
0
31 Dec 2024
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
49
2
0
26 Dec 2024
Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
Zhongjian Hu
Peng Yang
Bing Li
Zhenqi Wang
37
0
0
24 Dec 2024
Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang
Peng Jin
Shuo Yang
Bin Lin
Bin Zhu
...
Liuhan Chen
Francis E. H. Tay
Ser-Nam Lim
Harry Yang
Li Yuan
120
8
0
19 Dec 2024
MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance
Wenjun Huang
Jianguo Hu
79
0
0
14 Dec 2024
[MASK] is All You Need
Vincent Tao Hu
Bjorn Ommer
DiffM
135
2
0
09 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
B. Li
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Z. Yang
Xiangyu Yue
MLLM
AuLLM
VLM
79
5
0
03 Dec 2024
One Diffusion to Generate Them All
Duong H. Le
Tuan Pham
Sangho Lee
Christopher Clark
Aniruddha Kembhavi
Stephan Mandt
Ranjay Krishna
Jiasen Lu
VLM
59
5
0
25 Nov 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Ion Stoica
106
5
0
25 Nov 2024
Efficient Online Inference of Vision Transformers by Training-Free Tokenization
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
82
0
0
23 Nov 2024
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Jiange Yang
Haoyi Zhu
Y. Wang
Gangshan Wu
Tong He
Limin Wang
86
2
0
21 Nov 2024
LaVin-DiT: Large Vision Diffusion Transformer
Zhaoqing Wang
Xiaobo Xia
Runnan Chen
Dongdong Yu
Changhu Wang
M. Gong
Tongliang Liu
92
6
0
18 Nov 2024
Autoregressive Models in Vision: A Survey
Jing Xiong
Gongye Liu
Lun Huang
Chengyue Wu
Taiqiang Wu
...
M. Zhang
Guillermo Sapiro
Jiebo Luo
Ping Luo
Ngai Wong
VGen
46
9
0
08 Nov 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Zhaofeng Wu
Xinyan Velocity Yu
Dani Yogatama
Jiasen Lu
Yoon Kim
AIFin
41
10
0
07 Nov 2024
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization
Yiwei Zhang
Jin Gao
Fudong Ge
Guan Luo
Bing Li
Z. Zhang
Haibin Ling
Weiming Hu
47
0
0
03 Nov 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLM
LM&Ro
36
28
0
30 Oct 2024
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation
Zhaochong An
Guolei Sun
Yun Liu
Runjia Li
Min Wu
Ming-Ming Cheng
Ender Konukoglu
Serge J. Belongie
57
4
0
29 Oct 2024
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Hanyu Wang
Saksham Suri
Yixuan Ren
Hao Chen
Abhinav Shrivastava
VGen
21
9
0
28 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
21
0
0
23 Oct 2024
1
2
3
4
5
6
7
Next