ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,258 papers shown
Title
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal
  Prompt Learning
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning
Mainak Singha
Ankit Jha
Divyam Gupta
Pranav Singla
Biplab Banerjee
VLM
379
2
0
05 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
310
24
0
03 Jul 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for
  Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLMLRM
270
12
0
25 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
430
10
0
24 Jun 2024
Towards Natural Language-Driven Assembly Using Foundation Models
Towards Natural Language-Driven Assembly Using Foundation Models
O. Joglekar
Tal Lancewicki
Dotan Di Castro
Vladimir Tchuiev
Zohar Feldman
Dotan Di Castro
LM&Ro
186
0
0
23 Jun 2024
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote
  Sensing Image Understanding
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding
Linrui Xu
Ling Zhao
Wang Guo
Qiujun Li
Kewang Long
Kaiqi Zou
Yuhan Wang
Haifeng Li
AI4TS
241
8
0
18 Jun 2024
WildVision: Evaluating Vision-Language Models in the Wild with Human
  Preferences
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Yujie Lu
Dongfu Jiang
Wenhu Chen
William Yang Wang
Yejin Choi
Bill Yuchen Lin
VLM
428
56
0
16 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for
  Device Directed Speech Detection
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech DetectionInterspeech (Interspeech), 2024
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
156
3
0
13 Jun 2024
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
Samar Fares
Klea Ziu
Toluwani Aremu
Nikita Durasov
Martin Takáč
Pascal Fua
Karthik Nandakumar
Ivan Laptev
VLMAAML
223
9
0
13 Jun 2024
ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery
Kam Woh Ng
Xiatian Zhu
Yi-Zhe Song
Tao Xiang
218
2
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
171
8
0
11 Jun 2024
Learning Domain-Invariant Features for Out-of-Context News Detection
Learning Domain-Invariant Features for Out-of-Context News Detection
Yimeng Gu
Mengqi Zhang
Ignacio Castro
Shu Wu
Gareth Tyson
271
6
0
11 Jun 2024
Aligning Human Knowledge with Visual Concepts Towards Explainable
  Medical Image Classification
Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image ClassificationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024
Yunhe Gao
Difei Gu
Mu Zhou
Dimitris N. Metaxas
214
12
0
08 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAMLVLM
389
27
0
08 Jun 2024
Interpretable Multimodal Out-of-context Detection with Soft Logic
  Regularization
Interpretable Multimodal Out-of-context Detection with Soft Logic RegularizationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Huanhuan Ma
Jinghao Zhang
Qiang Liu
Shu Wu
Liang Wang
190
3
0
07 Jun 2024
ArMeme: Propagandistic Content in Arabic Memes
ArMeme: Propagandistic Content in Arabic Memes
Firoj Alam
A. Hasnat
Fatema Ahmed
Md. Arid Hasan
Maram Hasanain
170
11
0
06 Jun 2024
Multimodal Reasoning with Multimodal Knowledge Graph
Multimodal Reasoning with Multimodal Knowledge Graph
Junlin Lee
Yequan Wang
Jing Li
Min Zhang
234
47
0
04 Jun 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
349
1
0
04 Jun 2024
Hard Cases Detection in Motion Prediction by Vision-Language Foundation
  Models
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
Yi Yang
Qingwen Zhang
Kei Ikemura
Nazre Batool
John Folkesson
VLM
203
3
0
31 May 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action
  Anticipation using Large Video-Language Models
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
284
29
0
30 May 2024
Enhancing Large Vision Language Models with Self-Training on Image
  Comprehension
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Zou
Kai-Wei Chang
Wei Wang
SyDaVLMLRM
222
71
0
30 May 2024
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
Laura Fieback
Jakob Spiegelberg
Hanno Gottschalk
MLLM
438
7
0
29 May 2024
FinEmbedDiff: A Cost-Effective Approach of Classifying Financial
  Documents with Vector Sampling using Multi-modal Embedding Models
FinEmbedDiff: A Cost-Effective Approach of Classifying Financial Documents with Vector Sampling using Multi-modal Embedding Models
Anjanava Biswas
Wrick Talukdar
85
3
0
28 May 2024
Diagnosing the Compositional Knowledge of Vision Language Models from a
  Game-Theoretic View
Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View
Jin Wang
Shichao Dong
Yapeng Zhu
Kelu Yao
Weidong Zhao
Chao Li
Ping Luo
CoGeLRM
245
5
0
27 May 2024
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
Neha Kalibhat
Priyatham Kattakinda
Arman Zarei
Nikita Seleznev
Sam Sharpe
Samuel Sharpe
Senthil Kumar
Soheil Feizi
ViT
220
0
0
26 May 2024
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Xiyao Wang
Jiuhai Chen
Zhaoyang Wang
Yuhang Zhou
Yiyang Zhou
...
Wanrong Zhu
Tom Goldstein
Parminder Bhatia
Furong Huang
Cao Xiao
434
62
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
848
162
0
23 May 2024
PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering
  in Pituitary Surgery
PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery
Runlong He
Mengya Xu
Adrito Das
Danyal Z. Khan
Sophia Bano
Hani J. Marcus
Danail Stoyanov
Matthew J. Clarkson
Mobarakol Islam
141
11
0
22 May 2024
A Novel Fusion Architecture for PD Detection Using Semi-Supervised
  Speech Embeddings
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings
Tariq Adnan
Abdelrahman Abdelkader
Zipei Liu
Ekram Hossain
Sooyong Park
Md. Saiful Islam
Ehsan Hoque
107
5
0
21 May 2024
Resolving Word Vagueness with Scenario-guided Adapter for Natural
  Language Inference
Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference
Yuqi Liu
Mengyu Li
Di Liang
Ximing Li
Fausto Giunchiglia
Lan Huang
Xiaoyue Feng
Renchu Guan
170
10
0
21 May 2024
Enhancing Fine-Grained Image Classifications via Cascaded Vision
  Language Models
Enhancing Fine-Grained Image Classifications via Cascaded Vision Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Canshi Wei
VLM
194
2
0
18 May 2024
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based
  Inferencing
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based InferencingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Siddhant Agarwal
Shivam Sharma
Preslav Nakov
Tanmoy Chakraborty
238
8
0
18 May 2024
Review of Deep Representation Learning Techniques for Brain-Computer
  Interfaces and Recommendations
Review of Deep Representation Learning Techniques for Brain-Computer Interfaces and RecommendationsJournal of Neural Engineering (J. Neural Eng.), 2024
Pierre Guetschel
Sara Ahmadi
Michael Tangermann
292
5
0
17 May 2024
STAR: A Benchmark for Situated Reasoning in Real-World Videos
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu
Shoubin Yu
Zhenfang Chen
Joshua B. Tenenbaum
Chuang Gan
453
254
0
15 May 2024
Self-supervised vision-langage alignment of deep learning
  representations for bone X-rays analysis
Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis
A. Englebert
Anne-Sophie Collin
O. Cornu
Christophe De Vleeschouwer
220
1
0
14 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
256
2
0
12 May 2024
Realizing Visual Question Answering for Education: GPT-4V as a
  Multimodal AI
Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
Gyeong-Geon Lee
Xiaoming Zhai
146
22
0
12 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
183
0
0
09 May 2024
POV Learning: Individual Alignment of Multimodal Models using Human Perception
POV Learning: Individual Alignment of Multimodal Models using Human Perception
Simon Werner
Katharina Christ
Laura Bernardy
Marion G. Müller
Achim Rettinger
100
2
0
07 May 2024
Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach
Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach
Zhilin Zhang
Fangyu Wu
144
0
0
01 May 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question
  Answering by Understanding Vietnamese Text in Images
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
197
10
0
29 Apr 2024
Medical Vision-Language Pre-Training for Brain Abnormalities
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
208
2
0
27 Apr 2024
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge
Xiaohong Liu
Xiongkuo Min
Guangtao Zhai
Chunyi Li
Tengchuan Kou
...
Qi Yan
Youran Qu
Xiaohui Zeng
Lele Wang
Renjie Liao
362
43
0
25 Apr 2024
What Makes Multimodal In-Context Learning Work?
What Makes Multimodal In-Context Learning Work?
Folco Bertini Baldassini
Mustafa Shukor
Matthieu Cord
Laure Soulier
Benjamin Piwowarski
408
36
0
24 Apr 2024
Leveraging Speech for Gesture Detection in Multimodal Communication
Leveraging Speech for Gesture Detection in Multimodal Communication
E. Ghaleb
I. Burenko
Marlou Rasenberg
Wim Pouw
Ivan Toni
Peter Uhrig
Anna Wilson
Judith Holler
Asli Ozyurek
Raquel Fernández
SLR
177
5
0
23 Apr 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based
  Visual Question Answering
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
206
4
0
19 Apr 2024
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language
  Pre-training Models
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Shouwei Ruan
Yinpeng Dong
Hanqing Liu
Yao Huang
Hang Su
Xingxing Wei
VLM
328
3
0
18 Apr 2024
Variational Multi-Modal Hypergraph Attention Network for Multi-Modal
  Relation Extraction
Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction
Qian Li
Cheng Ji
Shu Guo
Yong Zhao
Qianren Mao
Shangguang Wang
Yuntao Wei
Jianxin Li
123
3
0
18 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
471
15
0
18 Apr 2024
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Jingmin Sun
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
AI4CE
371
37
0
18 Apr 2024
Previous
123456...242526
Next