Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 114 papers shown
Title
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
76
0
0
28 Apr 2025
TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation
Shintaro Ozaki
Kazuki Hayashi
Yusuke Sakai
Jingun Kwon
Hidetaka Kamigaito
Katsuhiko Hayashi
Manabu Okumura
Taro Watanabe
VLM
81
0
0
25 Apr 2025
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu
Kangheng Lin
Liang Zhao
Jisheng Yin
Yana Wei
...
Zheng Ge
Xiangyu Zhang
Daxin Jiang
Jingyu Wang
Wenbing Tao
VLM
OffRL
LRM
35
0
0
10 Apr 2025
Efficient Adaptation For Remote Sensing Visual Grounding
Hasan Moughnieh
Mohamad Chalhoub
Hasan Nasrallah
Cristiano Nattero
Paolo Campanella
Giovanni Nico
A. Ghandour
44
0
0
29 Mar 2025
SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints
Weiwen Hu
Niccolò Parodi
Marcus Zepp
I. Feldmann
O. Schreer
Peter Eisert
VLM
69
0
0
19 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
62
3
0
10 Mar 2025
Generative Artificial Intelligence in Robotic Manipulation: A Survey
Kun Zhang
Peng Yun
Jun Cen
Junhao Cai
DiDi Zhu
...
Qifeng Chen
Jia Pan
Wei K. Zhang
Bo Yang
Hua Chen
59
1
0
05 Mar 2025
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
57
0
0
20 Feb 2025
Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition
Jielong Tang
Zhenxing Wang
Ziyang Gong
Jianxing Yu
Shuang Wang
Jian Yin
38
0
0
28 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
30
0
0
20 Jan 2025
Instruction-Guided Scene Text Recognition
Yongkun Du
Z. Chen
Yuchen Su
Caiyan Jia
Yu-Gang Jiang
71
3
0
03 Jan 2025
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
Yue Zhang
Liqiang Jing
Vibhav Gogate
116
2
0
19 Dec 2024
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
66
2
0
20 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
54
2
0
14 Nov 2024
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Haiyang Wang
Yue Fan
Muhammad Ferjad Naeem
Yongqin Xian
J. E. Lenssen
Liwei Wang
F. Tombari
Bernt Schiele
41
2
0
30 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
L. Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
44
3
0
24 Oct 2024
Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model
Li Yuan
Yi Cai
Junsheng Huang
VLM
26
2
0
18 Oct 2024
Locality Alignment Improves Vision-Language Models
Ian Covert
Tony Sun
James Y. Zou
Tatsunori Hashimoto
VLM
64
3
0
14 Oct 2024
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization
Kento Kawaharazuka
Yoshiki Obinata
Naoaki Kanazawa
Kei Okada
Masayuki Inaba
LM&Ro
16
0
0
26 Sep 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Weifeng Lin
Xinyu Wei
Renrui Zhang
Le Zhuo
Shitian Zhao
...
Junlin Xie
Junlin Xie
Yu Qiao
Peng Gao
Hongsheng Li
MLLM
DiffM
50
10
0
23 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
62
1
0
04 Sep 2024
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
Bin Wang
Chunyu Xie
Dawei Leng
Yuhui Yin
MLLM
45
1
0
23 Aug 2024
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Sayna Ebrahimi
Sercan Ö. Arik
Tejas Nama
Tomas Pfister
37
1
0
13 Aug 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Yu Qiao
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
62
48
0
05 Aug 2024
Answerability Fields: Answerable Location Estimation via Diffusion Models
Daich Azuma
Taiki Miyanishi
Shuhei Kurita
Koya Sakamoto
M. Kawanabe
DiffM
31
0
0
26 Jul 2024
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang
Garrett Bingham
Adams Wei Yu
Quoc V. Le
Thang Luong
Golnaz Ghiasi
MLLM
LRM
35
9
0
22 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOS
LRM
52
2
0
18 Jul 2024
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Shunqi Mao
Chaoyi Zhang
Hang Su
Hwanjun Song
Igor Shalyminov
Weidong Cai
28
1
0
16 Jul 2024
Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge
Xiangyu Wu
Zhouyang Chi
Yang Yang
Jianfeng Lu
19
0
0
05 Jul 2024
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
Laura Fieback
Jakob Spiegelberg
Hanno Gottschalk
MLLM
57
5
0
29 May 2024
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation
Se-eun Yoon
Hyunsik Jeon
Julian McAuley
38
0
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
64
41
0
23 May 2024
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions
Junzhang Liu
Zhecan Wang
Hammad A. Ayyubi
Haoxuan You
Chris Thomas
Rui Sun
Shih-Fu Chang
Kai-Wei Chang
29
0
0
18 May 2024
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
CLIP
VLM
59
19
0
30 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
31
0
0
22 Apr 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
73
33
0
07 Apr 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
21
0
0
29 Mar 2024
Explore In-Context Segmentation via Latent Diffusion Models
Chaoyang Wang
Xiangtai Li
Henghui Ding
Lu Qi
Jiangning Zhang
Yunhai Tong
Chen Change Loy
Shuicheng Yan
DiffM
63
6
0
14 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
62
12
0
05 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
35
3
0
04 Mar 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
29
24
0
28 Feb 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
28
1
0
23 Feb 2024
CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps
Shigemichi Matsuzaki
Takuma Sugino
Kazuhito Tanaka
Zijun Sha
Shintaro Nakaoka
Shintaro Yoshizawa
Kazuhiro Shintani
VLM
11
5
0
08 Feb 2024
Convincing Rationales for Visual Question Answering Reasoning
Kun Li
G. Vosselman
Michael Ying Yang
34
1
0
06 Feb 2024
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
38
29
0
19 Dec 2023
Uni3DL: Unified Model for 3D and Language Understanding
Xiang Li
Jian Ding
Zhaoyang Chen
Mohamed Elhoseiny
26
3
0
05 Dec 2023
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
25
1
0
05 Dec 2023
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Rongyao Fang
Shilin Yan
Zhaoyang Huang
Jingqiu Zhou
Hao Tian
Jifeng Dai
Hongsheng Li
MLLM
22
8
0
30 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Alan L. Yuille
Jiahui Yu
VLM
11
3
0
27 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
31
140
0
10 Nov 2023
1
2
3
Next