ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08916
  4. Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
v1v2 (latest)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

International Conference on Learning Representations (ICLR), 2022
17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
    ObjDVLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 352 papers shown
Title
T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
Shao-Jun Xia
Huixin Zhang
Zhengzhong Tu
MLLMVLM
337
0
0
20 Nov 2025
Seg-VAR: Image Segmentation with Visual Autoregressive Modeling
Seg-VAR: Image Segmentation with Visual Autoregressive Modeling
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
K. Wang
Hengshuang Zhao
116
0
0
16 Nov 2025
Visual Bridge: Universal Visual Perception Representations Generating
Visual Bridge: Universal Visual Perception Representations Generating
Yilin Gao
Shuguang Dou
Junzhou Li
Zhiheng Yu
Yin Li
Dongsheng Jiang
Shugong Xu
DiffMVOS
302
0
0
11 Nov 2025
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Srikumar Sastry
Subash Khanal
Aayush Dhakal
Jiayu Lin
Dan Cher
Phoenix Jarosz
Nathan Jacobs
132
0
0
04 Nov 2025
Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Rahul Raja
A. Vats
139
1
0
23 Oct 2025
UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
Junzhi Ning
Wei Li
Cheng Tang
Jiashi Lin
Chenglong Ma
...
Yanzhou Su
Jin Ye
Shixiang Tang
Ming Hu
Junjun He
MedIm
290
1
0
17 Oct 2025
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
Mattia Segu
Marta Tintore Gazulla
Yongqin Xian
Luc Van Gool
Federico Tombari
74
0
0
16 Oct 2025
Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded PromptsChinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2025
Yanning Hou
K. Xu
J. Li
Yanran Ruan
Jianfeng Qiu
VLM
88
3
0
13 Oct 2025
Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding
Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding
Haoyuan Li
Rui Liu
Hehe Fan
Yi Yang
LM&Ro
86
0
0
20 Sep 2025
AToken: A Unified Tokenizer for Vision
AToken: A Unified Tokenizer for Vision
Jiasen Lu
Liangchen Song
Mingze Xu
Byeongjoo Ahn
Yanjun Wang
Chen Chen
Afshin Dehghan
Yinfei Yang
ViT
212
7
0
17 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
280
3
0
12 Sep 2025
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Junha Song
Yongsik Jo
So Yeon Min
Quanting Xie
Taehwan Kim
Yonatan Bisk
Jaegul Choo
VLM
136
0
0
29 Aug 2025
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
Anthony Bisulco
Rahul Ramesh
Randall Balestriero
Pratik Chaudhari
110
0
0
21 Aug 2025
Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models
Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models
Jiabo Huang
Chen Chen
Lingjuan Lyu
VLM
187
1
0
20 Aug 2025
ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
Weitai Kang
Weiming Zhuang
Zhizhong Li
Yan Yan
Lingjuan Lyu
102
1
0
11 Aug 2025
AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
Siminfar Samakoush Galougah
Rishie Raj
Sanjoy Chowdhury
Sayan Nag
Ramani Duraiswami
169
3
0
10 Aug 2025
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Miaosen Zhang
Ziqiang Xu
Jialiang Zhu
Qi Dai
Kai Qiu
...
Chong Luo
Tianyi Chen
Justin Wagle
Tim Franklin
Baining Guo
LRM
200
9
0
31 Jul 2025
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
Zedong Wang
Siyuan Li
Dan Xu
155
1
0
28 Jul 2025
CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs
CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs
Yangshu Yuan
Heng Chen
Xinyi Jiang
Christian Ng
Kexin Qiu
LRM
74
0
0
22 Jul 2025
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey
Jindong Li
Yali Fu
Jiahong Liu
Linxiao Cao
Wei Ji
Menglin Yang
Irwin King
Ming-Hsuan Yang
OffRL
138
2
0
21 Jul 2025
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
Fan Yang
Yousong Zhu
Xin Li
Yufei Zhan
Hongyin Zhao
Shurong Zheng
Yaowei Wang
Ming Tang
Jinqiao Wang
MLLMVLM
218
0
0
20 Jun 2025
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian SplattingComputer Vision and Pattern Recognition (CVPR), 2025
Ziyi Wang
Yanran Zhang
Jie Zhou
Jiwen Lu
3DPC3DGS
204
4
0
11 Jun 2025
Vision Generalist Model: A Survey
Vision Generalist Model: A SurveyInternational Journal of Computer Vision (IJCV), 2025
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
277
0
0
11 Jun 2025
EgoM2P: Egocentric Multimodal Multitask Pretraining
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li
Yutong Chen
Yiqian Wu
Kaifeng Zhao
Marc Pollefeys
Siyu Tang
EgoVVLM
367
4
0
09 Jun 2025
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury
Mohamed Elmoghany
Yohan Abeysinghe
Mahmoud Ahmed
Sayan Nag
Salman Khan
Mohamed Elhoseiny
Dinesh Manocha
321
4
0
08 Jun 2025
RecGPT: A Foundation Model for Sequential Recommendation
RecGPT: A Foundation Model for Sequential Recommendation
Yangqin Jiang
Xubin Ren
Lianghao Xia
Da Luo
Kangyi Lin
Chao Huang
LRM
311
0
0
06 Jun 2025
Is Extending Modality The Right Path Towards Omni-Modality?
Is Extending Modality The Right Path Towards Omni-Modality?
Tinghui Zhu
Kai Zhang
Muhao Chen
Eric Fosler-Lussier
VLM
254
3
0
02 Jun 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Siyuan Li
Juanxi Tian
Zedong Wang
Xin Jin
Zicheng Liu
Wentao Zhang
Dan Xu
218
0
0
01 Jun 2025
BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration
BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration
Xiaole Tang
Xiaoyi He
X. Gu
Jian Sun
153
2
0
27 May 2025
LlamaSeg: Image Segmentation via Autoregressive Mask Generation
LlamaSeg: Image Segmentation via Autoregressive Mask Generation
Jiru Deng
Tengjin Weng
Tianyu Yang
Tong Lu
Zhiheng Li
Wenhao Jiang
VLM
318
0
0
26 May 2025
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
Jin Wang
Yao Lai
Aoxue Li
Shifeng Zhang
Jiacheng Sun
Ning Kang
Chengyue Wu
Zhenguo Li
Ping Luo
346
17
0
26 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
246
2
0
11 May 2025
The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction
The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction
Tom Sander
Moritz Tenthoff
Kay Wohlfarth
Christian Wöhler
334
0
0
08 May 2025
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Kwon Byung-Ki
Jingdong Sun
Lee Hyoseok
Chong Luo
Tae-Hyun Oh
574
4
0
01 May 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
444
3
0
28 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Symbolic Representation for Any-to-Any Generative TasksComputer Vision and Pattern Recognition (CVPR), 2025
Jianfei Chen
Xiaoye Zhu
Yanjie Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
Qingbin Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
206
0
0
24 Apr 2025
SignX: The Foundation Model for Sign Recognition
SignX: The Foundation Model for Sign Recognition
Sen Fang
Chunyu Sui
Hongwei Yi
C. Neidle
Dimitris N. Metaxas
SLR
271
3
0
22 Apr 2025
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual PerceptionInternational Conference on Learning Representations (ICLR), 2025
Ziqi Pang
Xin Xu
Yu-Xiong Wang
DiffM
453
1
0
15 Apr 2025
GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions
GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions
Jo-Ku Cheng
Zeren Zhang
Ran Chen
Jingyang Deng
Ziran Qin
Jinwen Ma
295
6
0
14 Apr 2025
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Jialu Li
Shoubin Yu
Han Lin
Jaemin Cho
Jaehong Yoon
Joey Tianyi Zhou
DiffMVGen
302
5
0
11 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Jiuxiang Gu
Franck Dernoncourt
Wanrong Zhu
Wanrong Zhu
Tianyi Zhou
Tong Sun
411
12
0
07 Apr 2025
Continual Cross-Modal Generalization
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
247
1
0
01 Apr 2025
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Junzhu Mao
Yang Shen
Jinyang Guo
Yazhou Yao
Xiansheng Hua
ViT
303
2
0
30 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
404
6
0
29 Mar 2025
Unified Multimodal Discrete Diffusion
Unified Multimodal Discrete Diffusion
Alexander Swerdlow
Mihir Prabhudesai
Siddharth Gandhi
Deepak Pathak
Katerina Fragkiadaki
DiffM
288
23
0
26 Mar 2025
MMGen: Unified Multi-modal Image Generation and Understanding in One Go
MMGen: Unified Multi-modal Image Generation and Understanding in One Go
Jiepeng Wang
Zhaoqing Wang
H. Pan
Yuan Liu
Dongdong Yu
Changhu Wang
Wenping Wang
DiffM
286
6
0
26 Mar 2025
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
Xuan Ju
Weicai Ye
Quande Liu
Qiulin Wang
Xintao Wang
Pengfei Wan
Di Zhang
Kun Gai
Qiang Xu
VGen
291
21
0
25 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit CooperationComputer Vision and Pattern Recognition (CVPR), 2025
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
212
11
0
17 Mar 2025
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang
Changxu Cheng
Lingfeng Wang
Senda Chen
Wuyue Zhao
VLM
293
8
0
17 Mar 2025
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Tsu-Jui Fu
Yusu Qian
Chen Chen
Wenze Hu
Zhe Gan
Yue Yang
529
8
0
16 Mar 2025
12345678
Next