ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08916
  4. Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
    ObjD
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 327 papers shown
Title
Pretrained Language Models as Visual Planners for Human Assistance
Pretrained Language Models as Visual Planners for Human Assistance
Dhruvesh Patel
H. Eghbalzadeh
Nitin Kamra
Michael L. Iuzzolino
Unnat Jain
Ruta Desai
LM&Ro
11
24
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
26
99
0
17 Apr 2023
Segment Everything Everywhere All at Once
Segment Everything Everywhere All at Once
Xueyan Zou
Jianwei Yang
Hao Zhang
Feng Li
Linjie Li
Jianfeng Wang
Lijuan Wang
Jianfeng Gao
Yong Jae Lee
MLLM
VLM
9
453
0
13 Apr 2023
Exploring Effective Factors for Improving Visual In-Context Learning
Exploring Effective Factors for Improving Visual In-Context Learning
Yanpeng Sun
Qiang Chen
Jian Wang
Jingdong Wang
Zechao Li
LRM
VLM
43
24
0
10 Apr 2023
Towards Unified Scene Text Spotting based on Sequence Generation
Towards Unified Scene Text Spotting based on Sequence Generation
Taeho Kil
Seonghyeon Kim
Sukmin Seo
Yoon Kim
Daehee Kim
70
19
0
07 Apr 2023
SegGPT: Segmenting Everything In Context
SegGPT: Segmenting Everything In Context
Xinlong Wang
Xiaosong Zhang
Yue Cao
Wen Wang
Chunhua Shen
Tiejun Huang
VOS
MLLM
VLM
16
198
0
06 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLM
AI4TS
17
6
0
04 Apr 2023
Towards Flexible Multi-modal Document Models
Towards Flexible Multi-modal Document Models
Naoto Inoue
Kotaro Kikuchi
E. Simo-Serra
Mayu Otani
Kota Yamaguchi
32
20
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
16
43
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Xiao Wang
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
49
8
0
30 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
20
23
0
29 Mar 2023
Exposing and Addressing Cross-Task Inconsistency in Unified
  Vision-Language Models
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
A. Maharana
Amita Kamath
Christopher Clark
Mohit Bansal
Aniruddha Kembhavi
22
3
0
28 Mar 2023
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
Jongheon Jeong
Yang Zou
Taewan Kim
Dongqing Zhang
Avinash Ravichandran
O. Dabeer
VLM
67
184
0
26 Mar 2023
Train/Test-Time Adaptation with Retrieval
Train/Test-Time Adaptation with Retrieval
L. Zancato
Alessandro Achille
Tian Yu Liu
Matthew Trager
Pramuditha Perera
Stefano Soatto
TTA
OOD
8
11
0
25 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
37
12
0
23 Mar 2023
Contrastive Alignment of Vision to Language Through Parameter-Efficient
  Transfer Learning
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
Zaid Khan
Yun Fu
VLM
28
11
0
21 Mar 2023
Human Pose as Compositional Tokens
Human Pose as Compositional Tokens
Zigang Geng
Chunyu Wang
Yixuan Wei
Ze Liu
Houqiang Li
Han Hu
23
47
0
21 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLM
VLM
24
29
0
20 Mar 2023
Generative Semantic Segmentation
Generative Semantic Segmentation
Jia-Qing Chen
Jiachen Lu
Xiatian Zhu
Li Zhang
GAN
ISeg
VLM
22
38
0
20 Mar 2023
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
Seungju Han
Jack Hessel
Nouha Dziri
Yejin Choi
Youngjae Yu
VGen
25
16
0
17 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
ViM: Vision Middleware for Unified Downstream Transferring
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
14
1
0
13 Mar 2023
Universal Instance Perception as Object Discovery and Retrieval
Universal Instance Perception as Object Discovery and Retrieval
B. Yan
Yi-Xin Jiang
Jiannan Wu
D. Wang
Ping Luo
Zehuan Yuan
Huchuan Lu
VOS
VLM
LRM
27
161
0
12 Mar 2023
UniHCP: A Unified Model for Human-Centric Perceptions
UniHCP: A Unified Model for Human-Centric Perceptions
Yuanzheng Ci
Yizhou Wang
Meilin Chen
Shixiang Tang
Lei Bai
Feng Zhu
Rui Zhao
F. Yu
Donglian Qi
Wanli Ouyang
77
50
0
06 Mar 2023
Prismer: A Vision-Language Model with Multi-Task Experts
Prismer: A Vision-Language Model with Multi-Task Experts
Shikun Liu
Linxi Fan
Edward Johns
Zhiding Yu
Chaowei Xiao
Anima Anandkumar
VLM
MLLM
34
21
0
04 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
  Tasks
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
11
38
0
04 Mar 2023
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Zhou Yu
Xuecheng Ouyang
Zhenwei Shao
Mei Wang
Jun Yu
MLLM
86
11
0
03 Mar 2023
StraIT: Non-autoregressive Generation with Stratified Image Transformer
StraIT: Non-autoregressive Generation with Stratified Image Transformer
Shengju Qian
Huiwen Chang
Yuanzhen Li
Zizhao Zhang
Jiaya Jia
Han Zhang
31
10
0
01 Mar 2023
Language-Driven Representation Learning for Robotics
Language-Driven Representation Learning for Robotics
Siddharth Karamcheti
Suraj Nair
Annie S. Chen
Thomas Kollar
Chelsea Finn
Dorsa Sadigh
Percy Liang
LM&Ro
SSL
6
142
0
24 Feb 2023
Can Pre-trained Vision and Language Models Answer Visual
  Information-Seeking Questions?
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Yang Chen
Hexiang Hu
Yi Luan
Haitian Sun
Soravit Changpinyo
Alan Ritter
Ming-Wei Chang
27
79
0
23 Feb 2023
Backdoor Attacks to Pre-trained Unified Foundation Models
Backdoor Attacks to Pre-trained Unified Foundation Models
Zenghui Yuan
Yixin Liu
Kai Zhang
Pan Zhou
Lichao Sun
AAML
11
10
0
18 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
8
7
0
16 Feb 2023
PolyFormer: Referring Image Segmentation as Sequential Polygon
  Generation
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Jiang Liu
Hui Ding
Zhaowei Cai
Yuting Zhang
R. Satzoda
Vijay Mahadevan
R. Manmatha
ObjD
15
120
0
14 Feb 2023
Grounding Large Language Models in Interactive Environments with Online
  Reinforcement Learning
Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
Thomas Carta
Clément Romac
Thomas Wolf
Sylvain Lamprier
Olivier Sigaud
Pierre-Yves Oudeyer
LM&Ro
LLMAG
14
178
0
06 Feb 2023
See, Think, Confirm: Interactive Prompting Between Vision and Language
  Models for Knowledge-based Visual Reasoning
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning
Zhenfang Chen
Qinhong Zhou
Yikang Shen
Yining Hong
Hao Zhang
Chuang Gan
LRM
VLM
29
35
0
12 Jan 2023
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
Jia Ning
Chen Li
Zheng-Wei Zhang
Zigang Geng
Qi Dai
Kun He
Han Hu
33
42
0
05 Jan 2023
Do DALL-E and Flamingo Understand Each Other?
Do DALL-E and Flamingo Understand Each Other?
Hang Li
Jindong Gu
Rajat Koner
Sahand Sharifzadeh
Volker Tresp
MLLM
16
12
0
23 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Generalized Decoding for Pixel, Image, and Language
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
13
238
0
21 Dec 2022
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction
  Tuning
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Zhiyang Xu
Ying Shen
Lifu Huang
MLLM
16
110
0
21 Dec 2022
Universal Object Detection with Large Vision Model
Universal Object Detection with Large Vision Model
Feng-Huei Lin
Wenze Hu
Yaowei Wang
Yonghong Tian
Guangming Lu
Fanglin Chen
Yong-mei Xu
Xiaoyu Wang
VLM
ObjD
27
8
0
19 Dec 2022
Transferring General Multimodal Pretrained Models to Text Recognition
Transferring General Multimodal Pretrained Models to Text Recognition
Junyang Lin
Xuancheng Ren
Yichang Zhang
Gao Liu
Peng Wang
An Yang
Chang Zhou
32
4
0
19 Dec 2022
Egocentric Video Task Translation
Egocentric Video Task Translation
Zihui Xue
Yale Song
Kristen Grauman
Lorenzo Torresani
EgoV
21
13
0
13 Dec 2022
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
  Models
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
Jinze Bai
Rui Men
Han Yang
Xuancheng Ren
Kai Dang
...
Wenhang Ge
Jianxin Ma
Junyang Lin
Jingren Zhou
Chang Zhou
37
15
0
08 Dec 2022
Hierarchical multimodal transformers for Multi-Page DocVQA
Hierarchical multimodal transformers for Multi-Page DocVQA
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
11
54
0
07 Dec 2022
Unifying Vision, Text, and Layout for Universal Document Processing
Unifying Vision, Text, and Layout for Universal Document Processing
Zineng Tang
Ziyi Yang
Guoxin Wang
Yuwei Fang
Yang Liu
Chenguang Zhu
Michael Zeng
Chao-Yue Zhang
Mohit Bansal
VLM
28
105
0
05 Dec 2022
Images Speak in Images: A Generalist Painter for In-Context Visual
  Learning
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Xinlong Wang
Wen Wang
Yue Cao
Chunhua Shen
Tiejun Huang
VLM
MLLM
45
244
0
05 Dec 2022
Localization vs. Semantics: Visual Representations in Unimodal and
  Multimodal Models
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
Zhuowan Li
Cihang Xie
Benjamin Van Durme
Alan L. Yuille
VLM
SSL
12
2
0
01 Dec 2022
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
  Visual Representation
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation
Jiangyong Huang
William Zhu
Baoxiong Jia
Zan Wang
Xiaojian Ma
Qing Li
Siyuan Huang
24
5
0
28 Nov 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic
  Completion Learning
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
32
13
0
24 Nov 2022
Unifying Vision-Language Representation Space with Single-tower
  Transformer
Unifying Vision-Language Representation Space with Single-tower Transformer
Jiho Jang
Chaerin Kong
D. Jeon
Seonhoon Kim
Nojun Kwak
19
19
0
21 Nov 2022
Visual Programming: Compositional visual reasoning without training
Visual Programming: Compositional visual reasoning without training
Tanmay Gupta
Aniruddha Kembhavi
ReLM
VLM
LRM
43
397
0
18 Nov 2022
Previous
1234567
Next