ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08916
  4. Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
v1v2 (latest)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

International Conference on Learning Representations (ICLR), 2022
17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
    ObjDVLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 352 papers shown
Unified Language Representation for Question Answering over Text,
  Tables, and Images
Unified Language Representation for Question Answering over Text, Tables, and ImagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Yu Bowen
Cheng Fu
Haiyang Yu
Fei Huang
Yongbin Li
LMTD
257
30
0
29 Jun 2023
Semi-supervised Multimodal Representation Learning through a Global
  Workspace
Semi-supervised Multimodal Representation Learning through a Global WorkspaceIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023
Benjamin Devillers
Léopold Maytié
R. V. Rullen
SSL
186
10
0
27 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelInternational Conference on Learning Representations (ICLR), 2023
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLMCLIP
200
11
0
15 Jun 2023
Towards AGI in Computer Vision: Lessons Learned from GPT and Large
  Language Models
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Lingxi Xie
Longhui Wei
Xiaopeng Zhang
Kaifeng Bi
Xiaotao Gu
Jianlong Chang
Qi Tian
254
9
0
14 Jun 2023
AVIS: Autonomous Visual Information Seeking with Large Language Model
  Agent
AVIS: Autonomous Visual Information Seeking with Large Language Model AgentNeural Information Processing Systems (NeurIPS), 2023
Ziniu Hu
Ahmet Iscen
Chen Sun
Kai-Wei Chang
Luke Huan
David A. Ross
Cordelia Schmid
Alireza Fathi
298
12
0
13 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language
  Pre-training
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
252
8
0
12 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating
  weights fine-tuned on diverse rewards
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewardsNeural Information Processing Systems (NeurIPS), 2023
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
360
201
0
07 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
  Tasks? A: Self-Train on Unlabeled Images!
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!Computer Vision and Pattern Recognition (CVPR), 2023
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLMMLLM
248
24
0
06 Jun 2023
Unifying (Machine) Vision via Counterfactual World Modeling
Unifying (Machine) Vision via Counterfactual World Modeling
Daniel M. Bear
Kevin T. Feigelis
Honglin Chen
Wanhee Lee
R. Venkatesh
Klemen Kotar
Alex Durango
Daniel L. K. Yamins
VGen
190
20
0
02 Jun 2023
Bytes Are All You Need: Transformers Operating Directly On File Bytes
Bytes Are All You Need: Transformers Operating Directly On File Bytes
Maxwell Horton
Sachin Mehta
Ali Farhadi
Mohammad Rastegari
VLM
204
11
0
31 May 2023
There is more to graphs than meets the eye: Learning universal features
  with self-supervision
There is more to graphs than meets the eye: Learning universal features with self-supervision
L. Das
Sai Munikoti
M. Halappanavar
SSLOOD
202
1
0
31 May 2023
Generate then Select: Open-ended Visual Question Answering Guided by
  World Knowledge
Generate then Select: Open-ended Visual Question Answering Guided by World KnowledgeAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Xingyu Fu
Shenmin Zhang
Gukyeong Kwon
Pramuditha Perera
Henghui Zhu
...
Zhiguo Wang
Vittorio Castelli
Patrick Ng
Dan Roth
Bing Xiang
193
31
0
30 May 2023
GPT4Tools: Teaching Large Language Model to Use Tools via
  Self-instruction
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instructionNeural Information Processing Systems (NeurIPS), 2023
Rui Yang
Lin Song
Yanwei Li
Sijie Zhao
Yixiao Ge
Xiu Li
Ying Shan
SyDaMLLM
249
285
0
30 May 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
334
252
0
29 May 2023
Deeply Coupled Cross-Modal Prompt Learning
Deeply Coupled Cross-Modal Prompt LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Xuejing Liu
Wei Tang
Jinghui Lu
Rui Zhao
Zhaojun Guo
Fei Tan
VLM
209
21
0
29 May 2023
Generating Images with Multimodal Language Models
Generating Images with Multimodal Language ModelsNeural Information Processing Systems (NeurIPS), 2023
Jing Yu Koh
Daniel Fried
Ruslan Salakhutdinov
MLLM
359
326
0
26 May 2023
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained
  Transformer for Vision, Language, and Multimodal Tasks
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal TasksNature Network Boston (NNB), 2023
Kai Zhang
Jun Yu
Eashan Adhikarla
Rong Zhou
Zhilin Yan
...
Hang Zhang
Yong Chen
Shijie Zhao
Hongfang Liu
Lichao Sun
LM&MAMedIm
314
11
0
26 May 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
236
7
0
26 May 2023
Exploring Diverse In-Context Configurations for Image Captioning
Exploring Diverse In-Context Configurations for Image CaptioningNeural Information Processing Systems (NeurIPS), 2023
Xu Yang
Yongliang Wu
Mingzhuo Yang
Haokun Chen
Xin Geng
MLLM
299
77
0
24 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Weakly-Supervised Learning of Visual Relations in Multimodal PretrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
295
6
0
23 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language,
  and Speech Data
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
154
4
0
21 May 2023
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Multimodal Web Navigation with Instruction-Finetuned Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023
Hiroki Furuta
Kuang-Huei Lee
Ofir Nachum
Yutaka Matsuo
Aleksandra Faust
S. Gu
Izzeddin Gur
LM&Ro
413
142
0
19 May 2023
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual
  Grounding
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding
Chenchi Zhang
Jun Xiao
Lei Chen
Jian Shao
Long Chen
VLMLRM
171
3
0
19 May 2023
LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and
  Generation
LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and GenerationInternational Conference on Learning Representations (ICLR), 2023
Suhyeon Lee
Won Jun Kim
Jinho Chang
Jong Chul Ye
MedIm
579
70
0
19 May 2023
VisionLLM: Large Language Model is also an Open-Ended Decoder for
  Vision-Centric Tasks
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric TasksNeural Information Processing Systems (NeurIPS), 2023
Wen Wang
Zhe Chen
Xiaokang Chen
Jiannan Wu
Xizhou Zhu
...
Ping Luo
Tong Lu
Jie Zhou
Yu Qiao
Jifeng Dai
MLLMVLM
302
617
0
18 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLMMLLMObjD
576
153
0
18 May 2023
Segment Any Anomaly without Training via Hybrid Prompt Regularization
Segment Any Anomaly without Training via Hybrid Prompt RegularizationIEEE Transactions on Cybernetics (IEEE Trans. Cybern.), 2023
Yunkang Cao
Xiaohao Xu
Chen Sun
Y. Cheng
Zongwei Du
Liang Gao
Nong Sang
VLM
270
88
0
18 May 2023
Musketeer: Joint Training for Multi-task Vision Language Model with Task
  Explanation Prompts
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
Zhaoyang Zhang
Yantao Shen
Kunyu Shi
Zhaowei Cai
Jun Fang
Siqi Deng
Hao Yang
Davide Modolo
Zhuowen Tu
Stefano Soatto
VLM
243
3
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question AnsweringNeural Information Processing Systems (NeurIPS), 2023
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
395
199
0
11 May 2023
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
  Question Answering in Vietnamese
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in VietnameseInformation Fusion (Inf. Fusion), 2023
Nghia Hieu Nguyen
Duong T.D. Vo
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
194
27
0
07 May 2023
An Empirical Study of Multimodal Model Merging
An Empirical Study of Multimodal Model MergingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
329
52
0
28 Apr 2023
$π$-Tuning: Transferring Multimodal Foundation Models with Optimal
  Multi-task Interpolation
πππ-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task InterpolationInternational Conference on Machine Learning (ICML), 2023
Chengyue Wu
Teng Wang
Yixiao Ge
Zeyu Lu
Rui-Zhi Zhou
Ying Shan
Ping Luo
MoMe
213
43
0
27 Apr 2023
Transformer-Based Visual Segmentation: A Survey
Transformer-Based Visual Segmentation: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Xiangtai Li
Henghui Ding
Haobo Yuan
Wenwei Zhang
Jiangmiao Pang
Guangliang Cheng
Kai-xiang Chen
Ziwei Liu
Chen Change Loy
ViTMedIm
370
244
0
19 Apr 2023
Pretrained Language Models as Visual Planners for Human Assistance
Pretrained Language Models as Visual Planners for Human AssistanceIEEE International Conference on Computer Vision (ICCV), 2023
Dhruvesh Patel
H. Eghbalzadeh
Nitin Kamra
Michael L. Iuzzolino
Unnat Jain
Ruta Desai
LM&Ro
326
35
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
383
150
0
17 Apr 2023
Segment Everything Everywhere All at Once
Segment Everything Everywhere All at OnceNeural Information Processing Systems (NeurIPS), 2023
Xueyan Zou
Jianwei Yang
Hao Zhang
Feng Li
Linjie Li
Jianfeng Wang
Lijuan Wang
Jianfeng Gao
Yong Jae Lee
MLLMVLM
410
674
0
13 Apr 2023
Exploring Effective Factors for Improving Visual In-Context Learning
Exploring Effective Factors for Improving Visual In-Context LearningIEEE Transactions on Image Processing (IEEE TIP), 2023
Yanpeng Sun
Qiang Chen
Xiaofan Li
Jian Wang
Jingdong Wang
Zechao Li
VLMLRM
247
45
0
10 Apr 2023
Towards Unified Scene Text Spotting based on Sequence Generation
Towards Unified Scene Text Spotting based on Sequence GenerationComputer Vision and Pattern Recognition (CVPR), 2023
Taeho Kil
Seonghyeon Kim
Sukmin Seo
Yoon Kim
Daehee Kim
160
29
0
07 Apr 2023
SegGPT: Segmenting Everything In Context
SegGPT: Segmenting Everything In Context
Xinlong Wang
Xiaosong Zhang
Yue Cao
Wen Wang
Chunhua Shen
Tiejun Huang
VOSMLLMVLM
203
244
0
06 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLMAI4TS
262
8
0
04 Apr 2023
Towards Flexible Multi-modal Document Models
Towards Flexible Multi-modal Document ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Naoto Inoue
Kotaro Kikuchi
E. Simo-Serra
Mayu Otani
Kota Yamaguchi
229
31
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
319
89
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
243
9
0
30 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLMVLM
379
30
0
29 Mar 2023
Exposing and Addressing Cross-Task Inconsistency in Unified
  Vision-Language Models
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
A. Maharana
Amita Kamath
Christopher Clark
Joey Tianyi Zhou
Aniruddha Kembhavi
247
3
0
28 Mar 2023
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and SegmentationComputer Vision and Pattern Recognition (CVPR), 2023
Jongheon Jeong
Yang Zou
Taewan Kim
Dongqing Zhang
Avinash Ravichandran
Onkar Dabeer
VLM
399
347
0
26 Mar 2023
Train/Test-Time Adaptation with Retrieval
Train/Test-Time Adaptation with RetrievalComputer Vision and Pattern Recognition (CVPR), 2023
Luca Zancato
Alessandro Achille
Tian Yu Liu
Matthew Trager
Pramuditha Perera
Stefano Soatto
TTAOOD
195
14
0
25 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
CoBIT: A Contrastive Bi-directional Image-Text Generation ModelInternational Conference on Learning Representations (ICLR), 2023
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
210
14
0
23 Mar 2023
Contrastive Alignment of Vision to Language Through Parameter-Efficient
  Transfer Learning
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer LearningInternational Conference on Learning Representations (ICLR), 2023
Zaid Khan
Yun Fu
VLM
167
20
0
21 Mar 2023
Human Pose as Compositional Tokens
Human Pose as Compositional TokensComputer Vision and Pattern Recognition (CVPR), 2023
Zigang Geng
Chunyu Wang
Yixuan Wei
Ze Liu
Houqiang Li
Han Hu
194
69
0
21 Mar 2023
Previous
12345678
Next