Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2206.08916
Cited By
v1
v2 (latest)
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
International Conference on Learning Representations (ICLR), 2022
17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"
50 / 352 papers shown
Title
Direction-Aware Diagonal Autoregressive Image Generation
Yijia Xu
Jianzhong Ju
Jian Luan
J. Cui
334
4
0
14 Mar 2025
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Rui Yang
Lin Song
Yicheng Xiao
Runhui Huang
Yixiao Ge
Mingyu Ding
Hengshuang Zhao
MLLM
196
6
0
12 Mar 2025
Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2025
N. Islam
Dongao Ma
Jiaxuan Pang
Shivasakthi Senthil Velan
Michael B. Gotway
Jianming Liang
203
0
0
12 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi Han
Yandong Tang
Liangqiong Qu
360
0
0
10 Mar 2025
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Xiangxiang Chu
Renda Li
Yong Wang
466
14
0
08 Mar 2025
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning
Zhong Ji
Weilong Cao
Yan Zhang
Yanwei Pang
Jungong Han
Xuelong Li
DiffM
VLM
238
0
0
06 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Nianzu Yang
Yun Zheng
Liwei Wang
ObjD
VLM
325
12
0
03 Mar 2025
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Siyu Jiao
Gengwei Zhang
Yinlong Qian
Jiancheng Huang
Yao Zhao
Humphrey Shi
Lin Ma
Y. X. Wei
Zequn Jie
VLM
206
15
0
27 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
IEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
313
3
0
11 Feb 2025
Adaptive Perception for Unified Visual Multi-modal Object Tracking
IEEE Transactions on Artificial Intelligence (IEEE TAI), 2025
Xiantao Hu
Bineng Zhong
Qihua Liang
Zhiyi Mo
Liangtao Shi
Ying Tai
Jian Yang
225
8
0
10 Feb 2025
Foundation Models for Anomaly Detection: Vision and Challenges
Jing Ren
Tao Tang
Hong Jia
Haytham Fayek
Haytham Fayek
Xiaodong Li
Suyu Ma
Xiwei Xu
Feng Xia
409
2
0
10 Feb 2025
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Hammad A. Ayyubi
Xuande Feng
Junzhang Liu
Xudong Lin
Zhecan Wang
Shih-Fu Chang
149
1
0
24 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
AAAI Conference on Artificial Intelligence (AAAI), 2025
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
189
0
0
20 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Neural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
707
116
0
03 Jan 2025
Towards Visual Grounding: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
803
26
0
28 Dec 2024
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
ACM Symposium on Applied Computing (SAC), 2024
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
313
6
0
26 Dec 2024
Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
Zhongjian Hu
Peng Yang
Bing Li
Zhenqi Wang
213
2
0
24 Dec 2024
Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang
Peng Jin
Shuo Yang
Bin Lin
Bin Zhu
...
Liuhan Chen
Francis E. H. Tay
Ser-Nam Lim
Harry Yang
Li Yuan
511
19
0
19 Dec 2024
MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance
Wenjun Huang
Jianguo Hu
191
0
0
14 Dec 2024
[MASK] is All You Need
Vincent Tao Hu
Bjorn Ommer
DiffM
447
8
0
09 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
Yangqiu Song
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Zhiyong Yang
Xiangyu Yue
MLLM
AuLLM
VLM
235
25
0
03 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Eric Liang
337
15
0
25 Nov 2024
One Diffusion to Generate Them All
Computer Vision and Pattern Recognition (CVPR), 2024
Duong H. Le
Tuan Pham
Sangho Lee
Christopher Clark
Aniruddha Kembhavi
Stephan Mandt
Ranjay Krishna
Jiasen Lu
VLM
408
32
0
25 Nov 2024
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
585
0
0
23 Nov 2024
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Computer Vision and Pattern Recognition (CVPR), 2024
Jiange Yang
Haoyi Zhu
Yanjie Wang
Gangshan Wu
Tong He
Limin Wang
360
11
0
21 Nov 2024
LaVin-DiT: Large Vision Diffusion Transformer
Computer Vision and Pattern Recognition (CVPR), 2024
Zhaoqing Wang
Xiaobo Xia
Runnan Chen
Dongdong Yu
Changhu Wang
Mingming Gong
Tongliang Liu
485
19
0
18 Nov 2024
Autoregressive Models in Vision: A Survey
Jing Xiong
Gongye Liu
Lun Huang
Chengyue Wu
Taiqiang Wu
...
Hao Fei
Guillermo Sapiro
Jiebo Luo
Ping Luo
Ngai Wong
VGen
422
37
0
08 Nov 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
International Conference on Learning Representations (ICLR), 2024
Zhaofeng Wu
Xinyan Velocity Yu
Dani Yogatama
Jiasen Lu
Yoon Kim
AIFin
421
35
0
07 Nov 2024
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization
Neural Information Processing Systems (NeurIPS), 2024
Yiwei Zhang
Jin Gao
Fudong Ge
Guan Luo
Bing Li
Zheng Zhang
Haibin Ling
Weiming Hu
139
1
0
03 Nov 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLM
LM&Ro
360
108
0
30 Oct 2024
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation
International Conference on Learning Representations (ICLR), 2024
Zhaochong An
Guolei Sun
Yun Liu
Runjia Li
Min Wu
Ming-Ming Cheng
Ender Konukoglu
Serge Belongie
365
15
0
29 Oct 2024
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
International Conference on Learning Representations (ICLR), 2024
Hanyu Wang
Saksham Suri
Yixuan Ren
Hao Chen
Abhinav Shrivastava
VGen
279
26
0
28 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
International Journal of Computer Vision (IJCV), 2024
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
255
2
0
23 Oct 2024
Locality Alignment Improves Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
517
11
0
14 Oct 2024
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
195
6
0
12 Oct 2024
CAR: Controllable Autoregressive Modeling for Visual Generation
Ziyu Yao
Jialin Li
Yifeng Zhou
Yong Liu
Xi Jiang
Chengjie Wang
Feng Zheng
Yuexian Zou
Lei Li
DiffM
280
27
0
07 Oct 2024
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
International Conference on Learning Representations (ICLR), 2024
Minheng Ni
Yutao Fan
Lei Zhang
Wangmeng Zuo
LRM
AI4CE
203
19
0
04 Oct 2024
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
International Conference on Learning Representations (ICLR), 2024
Liang Chen
Sinan Tan
Zefan Cai
Weichu Xie
Haozhe Zhao
Yichi Zhang
Junyang Lin
Jinze Bai
Tianyu Liu
Baobao Chang
ViT
216
7
0
02 Oct 2024
Universal Medical Image Representation Learning with Compositional Decoders
Kaini Wang
Ling Yang
Siping Zhou
Guangquan Zhou
Wentao Zhang
Bin Cui
Shuo Li
SSL
MedIm
246
1
0
30 Sep 2024
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
Min Yang
Zichen Zhang
Limin Wang
AI4TS
167
0
0
27 Sep 2024
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
Neural Information Processing Systems (NeurIPS), 2024
Xun Zhu
Ying Hu
Fanbin Mo
Chenyi Guo
Ji Wu
245
15
0
26 Sep 2024
ChatCam: Empowering Camera Control through Conversational AI
Neural Information Processing Systems (NeurIPS), 2024
Xinhang Liu
Yu-Wing Tai
Chi-Keung Tang
VGen
216
10
0
25 Sep 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
International Conference on Learning Representations (ICLR), 2024
Weifeng Lin
Xinyu Wei
Renrui Zhang
Le Zhuo
Shitian Zhao
...
Junlin Xie
Junlin Xie
Yu Qiao
Peng Gao
Hongsheng Li
MLLM
DiffM
477
24
0
23 Sep 2024
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Xin Jiang
Junwei Zheng
Ruiping Liu
Jiahang Li
Jiaming Zhang
Sven Matthiesen
Rainer Stiefelhagen
VLM
151
2
0
21 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
381
5
0
19 Sep 2024
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
IEEE Robotics and Automation Letters (RA-L), 2024
Junjie Wen
Yinlin Zhu
Jinming Li
Minjie Zhu
Kun Wu
...
Ran Cheng
Yaxin Peng
Chaomin Shen
Feifei Feng
Jian Tang
LM&Ro
621
195
0
19 Sep 2024
DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Shota Nakada
Taichi Nishimura
Hokuto Munakata
Masayoshi Kondo
Tatsuya Komatsu
CLIP
VLM
139
2
0
18 Sep 2024
What to align in multimodal contrastive learning?
International Conference on Learning Representations (ICLR), 2024
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
293
24
0
11 Sep 2024
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Yujie Wang
Shenhan Zhu
Fangcheng Fu
Xupeng Miao
Jie Zhang
Juan Zhu
Fan Hong
Yongbin Li
Bin Cui
89
0
0
05 Sep 2024
AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning
AAAI Conference on Artificial Intelligence (AAAI), 2024
Sudarshan Rajagopalan
Vishal M. Patel
206
10
0
30 Aug 2024
Previous
1
2
3
4
5
6
7
8
Next