ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08916
  4. Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
v1v2 (latest)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

International Conference on Learning Representations (ICLR), 2022
17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
    ObjDVLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 352 papers shown
Title
A Simple and Generalist Approach for Panoptic Segmentation
A Simple and Generalist Approach for Panoptic Segmentation
Nedyalko Prisadnikov
Wouter Van Gansbeke
Danda Pani Paudel
Luc Van Gool
VLM
368
1
0
29 Aug 2024
Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models
Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models
Ian Stewart
Sameera Horawalavithana
Brendan Kennedy
Sai Munikoti
Karl Pazdernik
AAML
213
2
0
26 Aug 2024
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework
  for Multimodal Large Language Model
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language ModelNeural Information Processing Systems (NeurIPS), 2024
Chaoya Jiang
Jia Hongrui
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
VLM
149
3
0
22 Aug 2024
Universal Novelty Detection Through Adaptive Contrastive Learning
Universal Novelty Detection Through Adaptive Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2024
Hossein Mirzaei
Mojtaba Nafez
Mohammad Jafari
Mohammad Bagher Soltani
Mohammad Azizmalayeri
Jafar Habibi
Mohammad Sabokrou
M. Rohban
210
11
0
20 Aug 2024
DIVE: Towards Descriptive and Diverse Visual Commonsense Generation
DIVE: Towards Descriptive and Diverse Visual Commonsense GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jun-Hyung Park
Hyuntae Park
Youjin Kang
Eojin Jeon
SangKeun Lee
158
0
0
15 Aug 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Ping Luo
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
373
106
0
05 Aug 2024
XMeCap: Meme Caption Generation with Sub-Image Adaptability
XMeCap: Meme Caption Generation with Sub-Image Adaptability
Yuyan Chen
Songzhou Yan
Zhihong Zhu
Zhixu Li
Yanghua Xiao
VLM
386
16
0
24 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
196
13
0
23 Jul 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual
  Question Answering with Large Language Models
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
198
9
0
22 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
245
17
0
18 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOSLRM
453
16
0
18 Jul 2024
Compositional Structures in Neural Embedding and Interaction
  Decompositions
Compositional Structures in Neural Embedding and Interaction Decompositions
Matthew Trager
Alessandro Achille
Pramuditha Perera
Luca Zancato
Stefano Soatto
CoGe
266
0
0
12 Jul 2024
SoupLM: Model Integration in Large Language and Multi-Modal Models
SoupLM: Model Integration in Large Language and Multi-Modal Models
Yue Bai
Zichen Zhang
Jiasen Lu
Yun Fu
MoMe
130
1
0
11 Jul 2024
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language
  Model
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Yatai Ji
Shilong Zhang
Jie Wu
Peize Sun
Weifeng Chen
Xuefeng Xiao
Sidi Yang
Yanting Yang
Ping Luo
VLM
181
6
0
10 Jul 2024
Multi-Object Hallucination in Vision-Language Models
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen
Ziqiao Ma
Xuejun Zhang
Sihan Xu
Shengyi Qian
Jianing Yang
David Fouhey
Joyce Chai
276
42
0
08 Jul 2024
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
  Interleaved Image-Text Generation
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
Ethan Chern
Jiadi Su
Yan Ma
Pengfei Liu
MLLM
218
69
0
08 Jul 2024
VCHAR:Variance-Driven Complex Human Activity Recognition framework with
  Generative Representation
VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation
Yuan Sun
Navid Salami Pargoo
Taqiya Ehsan
Zhao Zhang
Jorge Ortiz
HAI
165
4
0
03 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring
  Expression Segmentation
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
262
6
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLMMLLM
355
21
0
01 Jul 2024
Toward a Diffusion-Based Generalist for Dense Vision Tasks
Toward a Diffusion-Based Generalist for Dense Vision Tasks
Yue Fan
Yongqin Xian
Xiaohua Zhai
Alexander Kolesnikov
Muhammad Ferjad Naeem
Bernt Schiele
Federico Tombari
VLMMDEDiffM
108
2
0
29 Jun 2024
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu
Yi R. Fung
Sha Li
Yixin Wan
Kai-Wei Chang
Heng Ji
199
10
0
20 Jun 2024
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of
  99%
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
Lei Zhu
Fangyun Wei
Yanye Lu
Dong Chen
VLM
203
64
0
17 Jun 2024
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Roman Bachmann
Oğuzhan Fatih Kar
David Mizrahi
Ali Garjani
Mingfei Gao
David Griffiths
Jiaming Hu
Afshin Dehghan
Amir Zamir
MoEVLMMLLM
234
33
0
13 Jun 2024
Autoregressive Model Beats Diffusion: Llama for Scalable Image
  Generation
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun
Yi Jiang
Shoufa Chen
Shilong Zhang
Bingyue Peng
Ping Luo
Zehuan Yuan
VLM
481
519
0
10 Jun 2024
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
Sucheng Ren
Xiaoke Huang
Xianhang Li
Junfei Xiao
Jieru Mei
Zeyu Wang
Alan Yuille
Yuyin Zhou
MedIm
186
11
0
08 Jun 2024
Generalist Multimodal AI: A Review of Architectures, Challenges and
  Opportunities
Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
Sai Munikoti
Ian Stewart
Sameera Horawalavithana
Henry Kvinge
Tegan H. Emerson
Sandra E Thompson
Karl Pazdernik
199
4
0
08 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in
  Large Multi-modal Models
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
192
30
0
04 Jun 2024
X-VILA: Cross-Modality Alignment for Large Language Model
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLMVLM
244
43
0
29 May 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
137
6
0
29 May 2024
Multi-modal Generation via Cross-Modal In-Context Learning
Multi-modal Generation via Cross-Modal In-Context Learning
Amandeep Kumar
Muzammal Naseer
Sanath Narayan
Rao Muhammad Anwer
Salman Khan
Hisham Cholakkal
MLLM
156
2
0
28 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Vasu Sharma
Eugenio Culurciello
285
26
0
28 May 2024
TrojFM: Resource-efficient Backdoor Attacks against Very Large
  Foundation Models
TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models
Yuzhou Nie
Yanting Wang
Jinyuan Jia
Michael J. De Lucia
Nathaniel D. Bastian
Wenbo Guo
Dawn Song
SILMAAML
198
8
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
265
11
0
26 May 2024
Activator: GLU Activation Function as the Core Component of a Vision Transformer
Activator: GLU Activation Function as the Core Component of a Vision Transformer
Abdullah Nazhat Abdullah
Tarkan Aydin
ViT
260
0
0
24 May 2024
Inquire, Interact, and Integrate: A Proactive Agent Collaborative
  Framework for Zero-Shot Multimodal Medical Reasoning
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning
Zishan Gu
Fenglin Liu
Changchang Yin
Ping Zhang
LRMLM&MA
219
2
0
19 May 2024
Libra: Building Decoupled Vision System on Large Language Models
Libra: Building Decoupled Vision System on Large Language ModelsInternational Conference on Machine Learning (ICML), 2024
Yifan Xu
Xiaoshan Yang
Y. Song
Changsheng Xu
MLLMVLM
166
10
0
16 May 2024
UniCorn: A Unified Contrastive Learning Approach for Multi-view
  Molecular Representation Learning
UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation LearningInternational Conference on Machine Learning (ICML), 2024
Shikun Feng
Yuyan Ni
Minghao Li
Yanwen Huang
Zhiming Ma
Wei-Ying Ma
Yanyan Lan
SSL
258
17
0
15 May 2024
DocRes: A Generalist Model Toward Unifying Document Image Restoration
  Tasks
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
Jiaxin Zhang
Dezhi Peng
Chongyu Liu
Peirong Zhang
Lianwen Jin
VLM
161
26
0
07 May 2024
One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal
  Multi-scale and Action Label Features
One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features
Trung Thanh Nguyen
Yasutomo Kawanishi
Takahiro Komamizu
Ichiro Ide
VLM
162
6
0
30 Apr 2024
UniFS: Universal Few-shot Instance Perception with Point Representations
UniFS: Universal Few-shot Instance Perception with Point Representations
Sheng Jin
Ruijie Yao
Lumin Xu
Wentao Liu
Chao Qian
Ji Wu
Ping Luo
238
2
0
30 Apr 2024
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in
  the Wild
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
Donggyun Kim
Seongwoong Cho
Semin Kim
Chong Luo
Seunghoon Hong
VLM
219
5
0
29 Apr 2024
What Makes Multimodal In-Context Learning Work?
What Makes Multimodal In-Context Learning Work?
Folco Bertini Baldassini
Mustafa Shukor
Matthieu Cord
Laure Soulier
Benjamin Piwowarski
380
36
0
24 Apr 2024
In-Context Translation: Towards Unifying Image Recognition, Processing,
  and Generation
In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation
Han Xue
Qianru Sun
Li Song
Wenjun Zhang
Zhiwu Huang
MLLM
157
0
0
15 Apr 2024
A Survey on Multimodal Wearable Sensor-based Human Action Recognition
A Survey on Multimodal Wearable Sensor-based Human Action Recognition
Jianyuan Ni
Hao Tang
Syed Tousiful Haque
Yan Yan
A. Ngu
264
21
0
14 Apr 2024
Connecting NeRFs, Images, and Text
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
294
7
0
11 Apr 2024
GLID: Pre-training a Generalist Encoder-Decoder Vision Model
GLID: Pre-training a Generalist Encoder-Decoder Vision Model
Jihao Liu
Jinliang Zheng
Yu Liu
Jiaming Song
VLM
171
6
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLMVLM
268
56
0
10 Apr 2024
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale
  Prediction
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale PredictionNeural Information Processing Systems (NeurIPS), 2024
Keyu Tian
Yi Jiang
Zehuan Yuan
Zehuan Yuan
Liwei Wang
VGen
363
685
0
03 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
340
20
0
28 Mar 2024
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
Zhitong Xiong
Yi Wang
Fahong Zhang
Adam J. Stewart
Joelle Hanna
Damian Borth
Ioannis Papoutsis
B. L. Saux
Gustau Camps-Valls
Xiao Xiang Zhu
AI4CE
269
51
0
22 Mar 2024
Previous
12345678
Next