v1v2 (latest)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

International Conference on Learning Representations (ICLR), 2022

17 June 2022

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 352 papers shown

Single-Model and Any-Modality for Video Object TrackingComputer Vision and Pattern Recognition (CVPR), 2023

Zongwei Wu

Jilai Zheng

Xiangxuan Ren

Florin-Alexandru Vasluianu

Chao Ma

D. Paudel

Luc Van Gool

Radu Timofte

340

27 Nov 2023

Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models

429

21 Nov 2023

An Embodied Generalist Agent in 3D World

Jiangyong Huang

Silong Yong

Xiaojian Ma

Xiongkun Linghu

Baoxiong Jia

337

291

18 Nov 2023

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

220

15 Nov 2023

Vision-Language Instruction Tuning: A Review and Analysis

Ying Shan

319

14 Nov 2023

PerceptionGPT: Effectively Fusing Visual Perception into LLMComputer Vision and Pattern Recognition (CVPR), 2023

Jiahui Gao

Tong Zhang

191

11 Nov 2023

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsComputer Vision and Pattern Recognition (CVPR), 2023

Yuliang Liu

489

382

11 Nov 2023

Analyzing Modular Approaches for Visual Question DecompositionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Apoorv Khandelwal

Ellie Pavlick

Chen Sun

258

10 Nov 2023

Florence-2: Advancing a Unified Representation for a Variety of Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023

Lu Yuan

377

371

10 Nov 2023

DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets

167

08 Nov 2023

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

196

08 Nov 2023

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality CollaborationComputer Vision and Pattern Recognition (CVPR), 2023

Jiabo Ye

Ji Zhang

Fei Huang

Jingren Zhou

MLLM VLM

455

596

07 Nov 2023

RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory SketchesInternational Conference on Learning Representations (ICLR), 2023

Montse Gonzalez Arenas

...

231

111

03 Nov 2023

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Jianwei Yang

186

01 Nov 2023

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023

Md Farhan Ishmam

Md Sakib Hossain Shovon

M. F. Mridha

Nilanjan Dey

397

01 Nov 2023

Object-centric Video Representation for Long-term Action AnticipationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

Shijie Wang

279

31 Oct 2023

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

220

31 Oct 2023

Exploring Question Decomposition for Zero-Shot VQANeural Information Processing Systems (NeurIPS), 2023

206

25 Oct 2023

Apollo: Zero-shot MultiModal Reasoning with Multiple Experts

175

25 Oct 2023

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

199

21 Oct 2023

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Chengxu Zhuang

Evelina Fedorenko

Jacob Andreas

281

20 Oct 2023

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

348

14 Oct 2023

PolyTask: Learning Unified Policies through Behavior Distillation

Siddhant Haldar

Lerrel Pinto

252

12 Oct 2023

Ferret: Refer and Ground Anything Anywhere at Any GranularityInternational Conference on Learning Representations (ICLR), 2023

Xianzhi Du

415

451

11 Oct 2023

Lightweight In-Context Tuning for Multimodal Unified Models

144

08 Oct 2023

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained ModelsNeural Information Processing Systems (NeurIPS), 2023

366

07 Oct 2023

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMsComputer Vision and Pattern Recognition (CVPR), 2023

Ming Yang

257

01 Oct 2023

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision GeneralistsInternational Conference on Learning Representations (ICLR), 2023

314

30 Sep 2023

SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain GapIEEE International Conference on Computer Vision (ICCV), 2023

279

21 Sep 2023

DreamLLM: Synergistic Multimodal Comprehension and CreationInternational Conference on Learning Representations (ICLR), 2023

Runpei Dong

Chunrui Han

Yuang Peng

...

Xiangyu Zhang

294

272

20 Sep 2023

RMT: Retentive Networks Meet Vision TransformersComputer Vision and Pattern Recognition (CVPR), 2023

589

171

20 Sep 2023

Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals

303

12 Sep 2023

InstructDiffusion: A Generalist Modeling Interface for Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023

...

Jianmin Bao

297

158

07 Sep 2023

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai

Shuai Bai

Shusheng Yang

Shijie Wang

Sinan Tan

Peng Wang

Junyang Lin

Chang Zhou

Jingren Zhou

MLLM VLM ObjD

513

1,565

24 Aug 2023

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination CapabilityIEEE International Conference on Computer Vision (ICCV), 2023

Runhu Huang

Jianhua Han

Guansong Lu

Xiaodan Liang

Yihan Zeng

Wei Zhang

Hang Xu

DiffM

161

18 Aug 2023

FocusFlow: Boosting Key-Points Optical Flow Estimation for Autonomous DrivingIEEE Transactions on Intelligent Vehicles (TIV), 2023

Zhonghua Yi

Kailun Yang

Kaiwei Wang

206

14 Aug 2023

Learning to Model the World with LanguageInternational Conference on Machine Learning (ICML), 2023

Pieter Abbeel

279

31 Jul 2023

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

303

30 Jul 2023

Towards Generalist Biomedical AI

...

Yossi Matias

K. Singhal

Peter R. Florence

Alan Karthikesalingam

Vivek Natarajan

LM&MA MedIm AI4MH

270

407

26 Jul 2023

Described Object Detection: Liberating Object Detection with Flexible ExpressionsNeural Information Processing Systems (NeurIPS), 2023

242

24 Jul 2023

UniFormaly: Towards Task-Agnostic Unified Framework for Visual Anomaly DetectionPattern Recognition (Pattern Recogn.), 2023

223

24 Jul 2023

Multimodal LLMs for health grounded in individual-specific data

233

18 Jul 2023

PAT: Parallel Attention Transformer for Visual Question Answering in VietnameseInternational Conference on Multimedia Analysis and Pattern Recognition (ICMAPR), 2023

Nghia Hieu Nguyen

Kiet Van Nguyen

200

17 Jul 2023

Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

Emanuele Dalsasso

151

13 Jul 2023

Objaverse-XL: A Universe of 10M+ 3D ObjectsNeural Information Processing Systems (NeurIPS), 2023

...

Carl Vondrick

286

637

11 Jul 2023

Emu: Generative Pretraining in MultimodalityInternational Conference on Learning Representations (ICLR), 2023

Hongcheng Gao

358

155

11 Jul 2023

Building Cooperative Embodied Agents Modularly with Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Chuang Gan

546

259

05 Jul 2023

Visual Instruction Tuning with Polite FlamingoAAAI Conference on Artificial Intelligence (AAAI), 2023

388

03 Jul 2023

JourneyDB: A Benchmark for Generative Image UnderstandingNeural Information Processing Systems (NeurIPS), 2023

Keqiang Sun

...

Yi Wang

Jifeng Dai

Yu Qiao

Limin Wang

Jiaming Song

340

166

03 Jul 2023

An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training

Mingyu Ding

Wei Zhan

Chuang Gan

135

29 Jun 2023