ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08916
  4. Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
v1v2 (latest)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

International Conference on Learning Representations (ICLR), 2022
17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
    ObjDVLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 352 papers shown
Title
Direction-Aware Diagonal Autoregressive Image Generation
Direction-Aware Diagonal Autoregressive Image Generation
Yijia Xu
Jianzhong Ju
Jian Luan
J. Cui
334
4
0
14 Mar 2025
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Rui Yang
Lin Song
Yicheng Xiao
Runhui Huang
Yixiao Ge
Mingyu Ding
Hengshuang Zhao
MLLM
196
6
0
12 Mar 2025
Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray AnalysisIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2025
N. Islam
Dongao Ma
Jiaxuan Pang
Shivasakthi Senthil Velan
Michael B. Gotway
Jianming Liang
203
0
0
12 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi Han
Yandong Tang
Liangqiong Qu
360
0
0
10 Mar 2025
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Xiangxiang Chu
Renda Li
Yong Wang
466
14
0
08 Mar 2025
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning
Zhong Ji
Weilong Cao
Yan Zhang
Yanwei Pang
Jungong Han
Xuelong Li
DiffMVLM
238
0
0
06 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Nianzu Yang
Yun Zheng
Liwei Wang
ObjDVLM
325
12
0
03 Mar 2025
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Siyu Jiao
Gengwei Zhang
Yinlong Qian
Jiancheng Huang
Yao Zhao
Humphrey Shi
Lin Ma
Y. X. Wei
Zequn Jie
VLM
206
15
0
27 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Vision-Language Models for Edge Networks: A Comprehensive SurveyIEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
313
3
0
11 Feb 2025
Adaptive Perception for Unified Visual Multi-modal Object TrackingIEEE Transactions on Artificial Intelligence (IEEE TAI), 2025
Xiantao Hu
Bineng Zhong
Qihua Liang
Zhiyi Mo
Liangtao Shi
Ying Tai
Jian Yang
225
8
0
10 Feb 2025
Foundation Models for Anomaly Detection: Vision and Challenges
Foundation Models for Anomaly Detection: Vision and Challenges
Jing Ren
Tao Tang
Hong Jia
Haytham Fayek
Haytham Fayek
Xiaodong Li
Suyu Ma
Xiwei Xu
Feng Xia
409
2
0
10 Feb 2025
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location PredictionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Hammad A. Ayyubi
Xuande Feng
Junzhang Liu
Xudong Lin
Zhecan Wang
Shih-Fu Chang
149
1
0
24 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
MASS: Overcoming Language Bias in Image-Text MatchingAAAI Conference on Artificial Intelligence (AAAI), 2025
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
189
0
0
20 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
707
116
0
03 Jan 2025
Towards Visual Grounding: A Survey
Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
803
26
0
28 Dec 2024
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge ComputingACM Symposium on Applied Computing (SAC), 2024
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
313
6
0
26 Dec 2024
Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
Zhongjian Hu
Peng Yang
Bing Li
Zhenqi Wang
213
2
0
24 Dec 2024
Next Patch Prediction for Autoregressive Visual Generation
Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang
Peng Jin
Shuo Yang
Bin Lin
Bin Zhu
...
Liuhan Chen
Francis E. H. Tay
Ser-Nam Lim
Harry Yang
Li Yuan
511
19
0
19 Dec 2024
MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision
  Performance
MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance
Wenjun Huang
Jianguo Hu
191
0
0
14 Dec 2024
[MASK] is All You Need
[MASK] is All You Need
Vincent Tao Hu
Bjorn Ommer
DiffM
447
8
0
09 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
  Audio-Visual Information?
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
Yangqiu Song
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Zhiyong Yang
Xiangyu Yue
MLLMAuLLMVLM
235
25
0
03 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large
  Models with Resource-aware Batching
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Eric Liang
337
15
0
25 Nov 2024
One Diffusion to Generate Them All
One Diffusion to Generate Them AllComputer Vision and Pattern Recognition (CVPR), 2024
Duong H. Le
Tuan Pham
Sangho Lee
Christopher Clark
Aniruddha Kembhavi
Stephan Mandt
Ranjay Krishna
Jiasen Lu
VLM
408
32
0
25 Nov 2024
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
585
0
0
23 Nov 2024
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy ConditioningComputer Vision and Pattern Recognition (CVPR), 2024
Jiange Yang
Haoyi Zhu
Yanjie Wang
Gangshan Wu
Tong He
Limin Wang
360
11
0
21 Nov 2024
LaVin-DiT: Large Vision Diffusion TransformerComputer Vision and Pattern Recognition (CVPR), 2024
Zhaoqing Wang
Xiaobo Xia
Runnan Chen
Dongdong Yu
Changhu Wang
Mingming Gong
Tongliang Liu
485
19
0
18 Nov 2024
Autoregressive Models in Vision: A Survey
Autoregressive Models in Vision: A Survey
Jing Xiong
Gongye Liu
Lun Huang
Chengyue Wu
Taiqiang Wu
...
Hao Fei
Guillermo Sapiro
Jiebo Luo
Ping Luo
Ngai Wong
VGen
422
37
0
08 Nov 2024
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and ModalitiesInternational Conference on Learning Representations (ICLR), 2024
Zhaofeng Wu
Xinyan Velocity Yu
Dani Yogatama
Jiasen Lu
Yoon Kim
AIFin
421
35
0
07 Nov 2024
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete
  Space via Vector Quantization
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector QuantizationNeural Information Processing Systems (NeurIPS), 2024
Yiwei Zhang
Jin Gao
Fudong Ge
Guan Luo
Bing Li
Zheng Zhang
Haibin Ling
Weiming Hu
139
1
0
03 Nov 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLMLM&Ro
360
108
0
30 Oct 2024
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation
Multimodality Helps Few-shot 3D Point Cloud Semantic SegmentationInternational Conference on Learning Representations (ICLR), 2024
Zhaochong An
Guolei Sun
Yun Liu
Runjia Li
Min Wu
Ming-Ming Cheng
Ender Konukoglu
Serge Belongie
365
15
0
29 Oct 2024
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
LARP: Tokenizing Videos with a Learned Autoregressive Generative PriorInternational Conference on Learning Representations (ICLR), 2024
Hanyu Wang
Saksham Suri
Yixuan Ren
Hao Chen
Abhinav Shrivastava
VGen
279
26
0
28 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
  Tuning
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningInternational Journal of Computer Vision (IJCV), 2024
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
255
2
0
23 Oct 2024
Locality Alignment Improves Vision-Language Models
Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
517
11
0
14 Oct 2024
Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
195
6
0
12 Oct 2024
CAR: Controllable Autoregressive Modeling for Visual Generation
CAR: Controllable Autoregressive Modeling for Visual Generation
Ziyu Yao
Jialin Li
Yifeng Zhou
Yong Liu
Xi Jiang
Chengjie Wang
Feng Zheng
Yuexian Zou
Lei Li
DiffM
280
27
0
07 Oct 2024
Visual-O1: Understanding Ambiguous Instructions via Multi-modal
  Multi-turn Chain-of-thoughts Reasoning
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts ReasoningInternational Conference on Learning Representations (ICLR), 2024
Minheng Ni
Yutao Fan
Lei Zhang
Wangmeng Zuo
LRMAI4CE
203
19
0
04 Oct 2024
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
  Transformer for Efficient Finegrained Image Generation
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image GenerationInternational Conference on Learning Representations (ICLR), 2024
Liang Chen
Sinan Tan
Zefan Cai
Weichu Xie
Haozhe Zhao
Yichi Zhang
Junyang Lin
Jinze Bai
Tianyu Liu
Baobao Chang
ViT
216
7
0
02 Oct 2024
Universal Medical Image Representation Learning with Compositional
  Decoders
Universal Medical Image Representation Learning with Compositional Decoders
Kaini Wang
Ling Yang
Siping Zhou
Guangquan Zhou
Wentao Zhang
Bin Cui
Shuo Li
SSLMedIm
246
1
0
30 Sep 2024
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
Min Yang
Zichen Zhang
Limin Wang
AI4TS
167
0
0
27 Sep 2024
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task
  Learning Via Connector-MoE
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoENeural Information Processing Systems (NeurIPS), 2024
Xun Zhu
Ying Hu
Fanbin Mo
Chenyi Guo
Ji Wu
245
15
0
26 Sep 2024
ChatCam: Empowering Camera Control through Conversational AI
ChatCam: Empowering Camera Control through Conversational AINeural Information Processing Systems (NeurIPS), 2024
Xinhang Liu
Yu-Wing Tai
Chi-Keung Tang
VGen
216
10
0
25 Sep 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language InstructionsInternational Conference on Learning Representations (ICLR), 2024
Weifeng Lin
Xinyu Wei
Renrui Zhang
Le Zhuo
Shitian Zhao
...
Junlin Xie
Junlin Xie
Yu Qiao
Peng Gao
Hongsheng Li
MLLMDiffM
477
24
0
23 Sep 2024
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive
  Technology
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive TechnologyIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Xin Jiang
Junwei Zheng
Ruiping Liu
Jiahang Li
Jiaming Zhang
Sven Matthiesen
Rainer Stiefelhagen
VLM
151
2
0
21 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
381
5
0
19 Sep 2024
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic ManipulationIEEE Robotics and Automation Letters (RA-L), 2024
Junjie Wen
Yinlin Zhu
Jinming Li
Minjie Zhu
Kun Wu
...
Ran Cheng
Yaxin Peng
Chaomin Shen
Feifei Feng
Jian Tang
LM&Ro
621
195
0
19 Sep 2024
DETECLAP: Enhancing Audio-Visual Representation Learning with Object
  Information
DETECLAP: Enhancing Audio-Visual Representation Learning with Object InformationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Shota Nakada
Taichi Nishimura
Hokuto Munakata
Masayoshi Kondo
Tatsuya Komatsu
CLIPVLM
139
2
0
18 Sep 2024
What to align in multimodal contrastive learning?
What to align in multimodal contrastive learning?International Conference on Learning Representations (ICLR), 2024
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
293
24
0
11 Sep 2024
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront SchedulingInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Yujie Wang
Shenhan Zhu
Fangcheng Fu
Xupeng Miao
Jie Zhang
Juan Zhu
Fan Hong
Yongbin Li
Bin Cui
89
0
0
05 Sep 2024
AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning
AWRaCLe: All-Weather Image Restoration using Visual In-Context LearningAAAI Conference on Artificial Intelligence (AAAI), 2024
Sudarshan Rajagopalan
Vishal M. Patel
206
10
0
30 Aug 2024
Previous
12345678
Next