v1v2 (latest)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

International Conference on Learning Representations (ICLR), 2022

17 June 2022

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 352 papers shown

Unified Language Representation for Question Answering over Text, Tables, and ImagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Fei Huang

257

29 Jun 2023

Semi-supervised Multimodal Representation Learning through a Global WorkspaceIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023

Benjamin Devillers

Léopold Maytié

R. V. Rullen

SSL

186

27 Jun 2023

COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelInternational Conference on Learning Representations (ICLR), 2023

200

15 Jun 2023

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

Xiaotao Gu

254

14 Jun 2023

AVIS: Autonomous Visual Information Seeking with Large Language Model AgentNeural Information Processing Systems (NeurIPS), 2023

298

13 Jun 2023

Global and Local Semantic Completion Learning for Vision-Language Pre-trainingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Wenzhe Zhao

Hongfa Wang

Yujiu Yang

Wei Liu

VLM

252

12 Jun 2023

Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewardsNeural Information Processing Systems (NeurIPS), 2023

360

201

07 Jun 2023

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!Computer Vision and Pattern Recognition (CVPR), 2023

248

06 Jun 2023

Unifying (Machine) Vision via Counterfactual World Modeling

190

02 Jun 2023

Bytes Are All You Need: Transformers Operating Directly On File Bytes

204

31 May 2023

There is more to graphs than meets the eye: Learning universal features with self-supervision

202

31 May 2023

Generate then Select: Open-ended Visual Question Answering Guided by World KnowledgeAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

...

193

30 May 2023

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instructionNeural Information Processing Systems (NeurIPS), 2023

Sijie Zhao

Ying Shan

249

285

30 May 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model

...

Mojtaba Seyedhosseini

334

252

29 May 2023

Deeply Coupled Cross-Modal Prompt LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Wei Tang

209

29 May 2023

Generating Images with Multimodal Language ModelsNeural Information Processing Systems (NeurIPS), 2023

Jing Yu Koh

Daniel Fried

Ruslan Salakhutdinov

MLLM

359

326

26 May 2023

BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal TasksNature Network Boston (NNB), 2023

Kai Zhang

...

Lichao Sun

314

26 May 2023

LANISTR: Multimodal Learning from Structured and Unstructured Data

Sayna Ebrahimi

Sercan O. Arik

Yihe Dong

Tomas Pfister

236

26 May 2023

Exploring Diverse In-Context Configurations for Image CaptioningNeural Information Processing Systems (NeurIPS), 2023

Mingzhuo Yang

299

24 May 2023

Weakly-Supervised Learning of Visual Relations in Multimodal PretrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

295

23 May 2023

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

...

Lu Yuan

154

21 May 2023

Multimodal Web Navigation with Instruction-Finetuned Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023

Hiroki Furuta

413

142

19 May 2023

TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding

Lei Chen

171

19 May 2023

LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and GenerationInternational Conference on Learning Representations (ICLR), 2023

579

19 May 2023

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric TasksNeural Information Processing Systems (NeurIPS), 2023

...

Yu Qiao

302

617

18 May 2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang

Shijie Wang

Junyang Lin

Shuai Bai

Xiaohuan Zhou

Jingren Zhou

Xinggang Wang

Chang Zhou

VLM MLLM ObjD

576

153

18 May 2023

Segment Any Anomaly without Training via Hybrid Prompt RegularizationIEEE Transactions on Cybernetics (IEEE Trans. Cybern.), 2023

270

18 May 2023

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

Yantao Shen

243

11 May 2023

Self-Chained Image-Language Model for Video Localization and Question AnsweringNeural Information Processing Systems (NeurIPS), 2023

395

199

11 May 2023

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in VietnameseInformation Fusion (Inf. Fusion), 2023

194

07 May 2023

An Empirical Study of Multimodal Model MergingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

329

28 Apr 2023

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal
Multi-task Interpolation

π

-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task InterpolationInternational Conference on Machine Learning (ICML), 2023

Zeyu Lu

Ying Shan

Ping Luo

MoMe

213

27 Apr 2023

Transformer-Based Visual Segmentation: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Xiangtai Li

370

244

19 Apr 2023

Pretrained Language Models as Visual Planners for Human AssistanceIEEE International Conference on Computer Vision (ICCV), 2023

Ruta Desai

326

17 Apr 2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

383

150

17 Apr 2023

Segment Everything Everywhere All at OnceNeural Information Processing Systems (NeurIPS), 2023

Jianwei Yang

410

674

13 Apr 2023

Exploring Effective Factors for Improving Visual In-Context LearningIEEE Transactions on Image Processing (IEEE TIP), 2023

247

10 Apr 2023

Towards Unified Scene Text Spotting based on Sequence GenerationComputer Vision and Pattern Recognition (CVPR), 2023

160

07 Apr 2023

SegGPT: Segmenting Everything In Context

Chunhua Shen

203

244

06 Apr 2023

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

262

04 Apr 2023

Towards Flexible Multi-modal Document ModelsComputer Vision and Pattern Recognition (CVPR), 2023

229

31 Mar 2023

Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Yongshuo Zong

Oisin Mac Aodha

Timothy M. Hospedales

SSL

319

31 Mar 2023

A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

...

243

30 Mar 2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

...

379

29 Mar 2023

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

247

28 Mar 2023

WinCLIP: Zero-/Few-Shot Anomaly Classification and SegmentationComputer Vision and Pattern Recognition (CVPR), 2023

399

347

26 Mar 2023

Train/Test-Time Adaptation with RetrievalComputer Vision and Pattern Recognition (CVPR), 2023

Matthew Trager

195

25 Mar 2023

CoBIT: A Contrastive Bi-directional Image-Text Generation ModelInternational Conference on Learning Representations (ICLR), 2023

210

23 Mar 2023

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer LearningInternational Conference on Learning Representations (ICLR), 2023

Zaid Khan

Yun Fu

VLM

167

21 Mar 2023

Human Pose as Compositional TokensComputer Vision and Pattern Recognition (CVPR), 2023

194

21 Mar 2023