ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.03135
  4. Cited By
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
  Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

7 April 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
    VLM
    ViT
ArXivPDFHTML

Papers citing "Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning"

50 / 175 papers shown
Title
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Zengrong Lin
Zheng Wang
Tianwen Qian
Pan Mu
Sixian Chan
Cong Bai
42
0
0
13 Mar 2025
Vision-Language Model IP Protection via Prompt-based Learning
Lianyu Wang
M. Wang
Huazhu Fu
Daoqiang Zhang
VLM
Presented at ResearchTrend Connect | VLM on 28 Mar 2025
127
0
0
04 Mar 2025
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Ming Dai
Jian Li
Jiedong Zhuang
Xian Zhang
Wankou Yang
ObjD
42
1
0
12 Jan 2025
CMAL: A Novel Cross-Modal Associative Learning Framework for
  Vision-Language Pre-Training
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
52
9
0
16 Oct 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled
  Multi-modal Fusion
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
27
9
0
26 Sep 2024
VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage
  $\underline{P}$re-training Framework for $\underline{Ro}$botic and
  Laparoscopic Surgery
VidLPRO: A Vid‾\underline{Vid}Vid​eo-L‾\underline{L}L​anguage P‾\underline{P}P​re-training Framework for Ro‾\underline{Ro}Ro​botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
58
1
0
07 Sep 2024
Pixels to Prose: Understanding the art of Image Captioning
Pixels to Prose: Understanding the art of Image Captioning
Hrishikesh Singh
Aarti Sharma
Millie Pant
3DV
VLM
25
0
0
28 Aug 2024
From Attributes to Natural Language: A Survey and Foresight on
  Text-based Person Re-identification
From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification
Fanzhi Jiang
Su Yang
Mark W. Jones
Liumei Zhang
47
1
0
31 Jul 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
14
0
0
18 Jul 2024
ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
Kam Woh Ng
Xiatian Zhu
Yi-Zhe Song
Tao Xiang
33
2
0
12 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal
  Hierarchical-Cross-Attention Model
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
34
0
0
12 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
30
13
0
08 Jun 2024
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical
  Study of VCR
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Xiaolin Chen
Liqiang Nie
Mohan S. Kankanhalli
LRM
19
7
0
27 May 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
32
7
0
16 Apr 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric
  Vision-Language Pretraining
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
21
4
0
01 Apr 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A
  Multi-Aspect Vision-Language Pre-training Framework
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh Nguyen Nhat To
Johan W. Verjans
51
11
0
12 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
62
12
0
05 Mar 2024
GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D Pretraining from Real-World Data
Haoyuan Li
Yanpeng Zhou
Yihan Zeng
Hang Xu
Xiaodan Liang
3DGS
CLIP
16
0
0
09 Feb 2024
Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic
  Segmentation
Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation
Wenhao Xu
Rongtao Xu
Changwei Wang
Shibiao Xu
Li Guo
Man Zhang
Xiaopeng Zhang
VLM
20
10
0
20 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
38
29
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual
  Grounding
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
33
24
0
19 Dec 2023
Prompt-based Distribution Alignment for Unsupervised Domain Adaptation
Prompt-based Distribution Alignment for Unsupervised Domain Adaptation
Shuanghao Bai
Min Zhang
Wanqi Zhou
Siteng Huang
Zhirong Luan
Donglin Wang
Badong Chen
OOD
VLM
11
32
0
15 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
13
3
0
11 Dec 2023
SA-Attack: Improving Adversarial Transferability of Vision-Language
  Pre-training Models via Self-Augmentation
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation
Bangyan He
Xiaojun Jia
Siyuan Liang
Tianrui Lou
Yang Liu
Xiaochun Cao
AAML
VLM
19
23
0
08 Dec 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
27
3
0
25 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
  Code-Vision Representation
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
25
10
0
22 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
24
56
0
16 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Cheng Yang
Rui Xu
Ye Guo
Peixiang Huang
Yiru Chen
Wenkui Ding
Zhongyuan Wang
Hong Zhou
LRM
8
5
0
09 Nov 2023
Enhancing Document Information Analysis with Multi-Task Pre-training: A
  Robust Approach for Information Extraction in Visually-Rich Documents
Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents
Tofik Ali
Partha Pratim Roy
11
0
0
25 Oct 2023
Large Language Models and Multimodal Retrieval for Visual Word Sense
  Disambiguation
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation
Anastasia Kritharoula
Maria Lymperaiou
Giorgos Stamou
17
4
0
21 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network
  for VL Representation
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
17
0
0
20 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of
  Multi-modal Large Models
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Yanyang Guo
Fangkai Jiao
Zhiqi Shen
Liqiang Nie
Mohan S. Kankanhalli
MLLM
14
5
0
17 Oct 2023
Expedited Training of Visual Conditioned Language Generation via
  Redundancy Reduction
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
15
7
0
05 Oct 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan S. Kankanhalli
VLM
14
3
0
28 Sep 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
13
26
0
25 Sep 2023
Kosmos-2.5: A Multimodal Literate Model
Kosmos-2.5: A Multimodal Literate Model
Tengchao Lv
Yupan Huang
Jingye Chen
Lei Cui
Shuming Ma
...
Weiyao Luo
Shaoxiang Wu
Guoxin Wang
Cha Zhang
Furu Wei
VLM
MLLM
21
63
0
20 Sep 2023
StructChart: Perception, Structuring, Reasoning for Visual Chart
  Understanding
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
Renqiu Xia
Bo-Wen Zhang
Hao Peng
Hancheng Ye
Xiangchao Yan
Peng Ye
Botian Shi
Yu Qiao
Junchi Yan
14
0
0
20 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language
  Models
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
34
24
0
08 Sep 2023
Parameter and Computation Efficient Transfer Learning for
  Vision-Language Pre-trained Models
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Qiong Wu
Wei Yu
Yiyi Zhou
Shubin Huang
Xiaoshuai Sun
R. Ji
VLM
13
7
0
04 Sep 2023
Sparkles: Unlocking Chats Across Multiple Images for Multimodal
  Instruction-Following Models
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Yupan Huang
Zaiqiao Meng
Fangyu Liu
Yixuan Su
Nigel Collier
Yutong Lu
MLLM
28
22
0
31 Aug 2023
Whether you can locate or not? Interactive Referring Expression
  Generation
Whether you can locate or not? Interactive Referring Expression Generation
Fulong Ye
Yuxing Long
Fangxiang Feng
Xiaojie Wang
19
4
0
19 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language
  Representation Learning
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
33
0
0
18 Aug 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
  Complex Visual Reasoning Tasks
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
23
2
0
31 Jul 2023
Self-Supervised and Semi-Supervised Polyp Segmentation using Synthetic
  Data
Self-Supervised and Semi-Supervised Polyp Segmentation using Synthetic Data
Enric Moreu
Eric Arazo
Kevin McGuinness
Noel E. O'Connor
19
1
0
22 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
24
4
0
15 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language
  Pre-training
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
19
25
0
13 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
15
5
0
06 Jul 2023
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Qiong Wu
Shubin Huang
Yiyi Zhou
Pingyang Dai
Annan Shu
Guannan Jiang
Rongrong Ji
VLM
VPVLM
14
2
0
27 Jun 2023
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
  Vision-Language Model for Remote Sensing
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Jianwei Yin
DiffM
VLM
18
52
0
20 Jun 2023
M3PT: A Multi-Modal Model for POI Tagging
M3PT: A Multi-Modal Model for POI Tagging
Jingsong Yang
Guanzhou Han
Deqing Yang
Jingping Liu
Yanghua Xiao
Xiang Xu
Baohua Wu
Shenghua Ni
48
3
0
16 Jun 2023
1234
Next