ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSL
    VLM
ArXivPDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,088 papers shown
Title
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text
  Transformation
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation
Change Che
Qunwei Lin
Xinyu Zhao
Jiaxin Huang
Liqiang Yu
VLM
17
37
0
02 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object
  Detectors
Generating Enhanced Negatives for Training Language-Based Object Detectors
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjD
VLM
32
5
0
29 Dec 2023
Towards a Unified Multimodal Reasoning Framework
Towards a Unified Multimodal Reasoning Framework
Abhinav Arun
Dipendra Singh Mal
Mehul Soni
Tomohiro Sawada
LRM
17
0
0
22 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in
  Language-Image Pretraining
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
14
6
0
19 Dec 2023
Expediting Contrastive Language-Image Pretraining via Self-distilled
  Encoders
Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders
Bumsoo Kim
Jinhyung Kim
Yeonsik Jo
S. Kim
VLM
21
3
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
38
29
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual
  Grounding
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
33
24
0
19 Dec 2023
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language
  Fusion
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Xiao Wang
Jiandong Jin
Chenglong Li
Jin Tang
Cheng Zhang
Wei Wang
VLM
15
13
0
17 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
G. Loaiza-Ganem
M. Volkovs
43
3
0
15 Dec 2023
SMILE: Multimodal Dataset for Understanding Laughter in Video with
  Language Models
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun
Kim Sung-Bin
Seungju Han
Youngjae Yu
Tae-Hyun Oh
25
13
0
15 Dec 2023
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment
Xiaoxu Xu
Yitian Yuan
Qiudan Zhang
Wen-Bin Wu
Zequn Jie
Lin Ma
Xu Wang
56
4
0
15 Dec 2023
Text-Guided Face Recognition using Multi-Granularity Cross-Modal
  Contrastive Learning
Text-Guided Face Recognition using Multi-Granularity Cross-Modal Contrastive Learning
Md Golam Moula Mehedi Hasan
S. Sami
Nasser M. Nasrabadi
23
4
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
14
4
0
14 Dec 2023
Toward General-Purpose Robots via Foundation Models: A Survey and
  Meta-Analysis
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
Yafei Hu
Quanting Xie
Vidhi Jain
Jonathan M Francis
Jay Patrikar
...
Xiaolong Wang
Sebastian A. Scherer
Z. Kira
Fei Xia
Yonatan Bisk
LM&Ro
AI4CE
30
63
0
14 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human
  Pathology
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
26
20
0
13 Dec 2023
Multimodal Pretraining of Medical Time Series and Notes
Multimodal Pretraining of Medical Time Series and Notes
Ryan N. King
Tianbao Yang
Bobak J. Mortazavi
23
12
0
11 Dec 2023
Medical Vision Language Pretraining: A survey
Medical Vision Language Pretraining: A survey
Prashant Shrestha
Sanskar Amgain
Bidur Khanal
Cristian A. Linte
Binod Bhattarai
VLM
32
14
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
13
3
0
11 Dec 2023
Identifying and Mitigating Model Failures through Few-shot CLIP-aided
  Diffusion Generation
Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation
Atoosa Malemir Chegini
S. Feizi
VLM
33
4
0
09 Dec 2023
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning
  Distilled from Large Language Models
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models
Hongzhan Lin
Ziyang Luo
Jing Ma
Long Chen
27
9
0
09 Dec 2023
Cross-BERT for Point Cloud Pretraining
Cross-BERT for Point Cloud Pretraining
Xin Li
Peng Li
Zeyong Wei
Zhe Zhu
Mingqiang Wei
Junhui Hou
Liangliang Nan
J. Qin
H. Xie
F. Wang
SSL
3DPC
28
0
0
08 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
J. Park
Jack Hessel
Khyathi Raghavi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
19
11
0
08 Dec 2023
Visual Grounding of Whole Radiology Reports for 3D CT Images
Visual Grounding of Whole Radiology Reports for 3D CT Images
Akimichi Ichinose
Taro Hatsutani
Keigo Nakamura
Yoshiro Kitamura
S. Iizuka
E. Simo-Serra
Shoji Kido
Noriyuki Tomiyama
13
7
0
08 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLM
ObjD
LRM
FAtt
16
8
0
07 Dec 2023
Adventures of Trustworthy Vision-Language Models: A Survey
Adventures of Trustworthy Vision-Language Models: A Survey
Mayank Vatsa
Anubhooti Jain
Richa Singh
22
4
0
07 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
26
28
0
07 Dec 2023
SequencePAR: Understanding Pedestrian Attributes via A Sequence
  Generation Paradigm
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm
Jiandong Jin
Xiao Wang
Chenglong Li
Lili Huang
Jin Tang
AI4TS
24
6
0
04 Dec 2023
Expand BERT Representation with Visual Information via Grounded Language
  Learning with Multimodal Partial Alignment
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
Cong-Duy Nguyen
The-Anh Vu-Le
Thong Nguyen
Tho Quan
A. Luu
23
5
0
04 Dec 2023
How to Configure Good In-Context Sequence for Visual Question Answering
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li
Jiawei Peng
Huiyi Chen
Chongyang Gao
Xu Yang
MLLM
15
20
0
04 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
  Captioning
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
Cong Yang
Zuchao Li
Lefei Zhang
29
23
0
02 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual
  Prompts
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
43
85
0
01 Dec 2023
Which way is `right'?: Uncovering limitations of Vision-and-Language
  Navigation model
Which way is `right'?: Uncovering limitations of Vision-and-Language Navigation model
Meera Hahn
Amit Raj
James M. Rehg
30
3
0
30 Nov 2023
A Lightweight Clustering Framework for Unsupervised Semantic
  Segmentation
A Lightweight Clustering Framework for Unsupervised Semantic Segmentation
Yau Shing Jonathan Cheung
Xi Chen
Lihe Yang
Hengshuang Zhao
12
1
0
30 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction
  Learner
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
24
2
0
29 Nov 2023
PALM: Predicting Actions through Language Models
PALM: Predicting Actions through Language Models
Sanghwan Kim
Daoji Huang
Yongqin Xian
Otmar Hilliges
Luc Van Gool
Xi Wang
VLM
19
10
0
29 Nov 2023
Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?
Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?
Wang Zhu
Ishika Singh
Yuan Huang
Robin Jia
Jesse Thomason
31
2
0
28 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
71
731
0
27 Nov 2023
InstructMol: Multi-Modal Integration for Building a Versatile and
  Reliable Molecular Assistant in Drug Discovery
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
He Cao
Zijing Liu
Xingyu Lu
Yuan Yao
Yu Li
22
58
0
27 Nov 2023
C-SAW: Self-Supervised Prompt Learning for Image Generalization in
  Remote Sensing
C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing
Avigyan Bhattacharya
Mainak Singha
Ankit Jha
Biplab Banerjee
SSL
VLM
19
6
0
27 Nov 2023
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Bin Xie
Jiale Cao
Jin Xie
Fahad Shahbaz Khan
Yanwei Pang
VLM
20
42
0
27 Nov 2023
Generalized Graph Prompt: Toward a Unification of Pre-Training and
  Downstream Tasks on Graphs
Generalized Graph Prompt: Toward a Unification of Pre-Training and Downstream Tasks on Graphs
Xingtong Yu
Zhenghao Liu
Yuan Fang
Zemin Liu
Sihong Chen
Xinming Zhang
33
24
0
26 Nov 2023
Boosting the Power of Small Multimodal Reasoning Models to Match Larger
  Models with Self-Consistency Training
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training
Cheng Tan
Jingxuan Wei
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Ruifeng Guo
Xihong Yang
Stan Z. Li
LRM
16
7
0
23 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
  Code-Vision Representation
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
30
11
0
22 Nov 2023
A Survey on Multimodal Large Language Models for Autonomous Driving
A Survey on Multimodal Large Language Models for Autonomous Driving
Can Cui
Yunsheng Ma
Xu Cao
Wenqian Ye
Yang Zhou
...
Xinrui Yan
Shuqi Mei
Jianguo Cao
Ziran Wang
Chao Zheng
38
249
0
21 Nov 2023
Active Prompt Learning in Vision Language Models
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
9
9
0
18 Nov 2023
Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based
  Inference
Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference
Marvin Schmitt
Stefan T. Radev
Paul-Christian Burkner
44
5
0
17 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
24
58
0
16 Nov 2023
Contrastive Transformer Learning with Proximity Data Generation for
  Text-Based Person Search
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search
Hefeng Wu
Weifeng Chen
Zhibin Liu
Tianshui Chen
Zhiguang Chen
Liang Lin
28
11
0
15 Nov 2023
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
  Parsing
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Yating Xu
Conghui Hu
Gim Hee Lee
17
2
0
14 Nov 2023
Learning Mutually Informed Representations for Characters and Subwords
Learning Mutually Informed Representations for Characters and Subwords
Yilin Wang
Xinyi Hu
Matthew R. Gormley
31
0
0
14 Nov 2023
Previous
123...8910...404142
Next