ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
475
15
0
18 Apr 2024
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Jingmin Sun
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
AI4CE
402
39
0
18 Apr 2024
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI
  Agent
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Wei Chen
Zhiyuan Li
LLMAG
128
9
0
17 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
434
19
0
16 Apr 2024
Evolving Interpretable Visual Classifiers with Large Language Models
Evolving Interpretable Visual Classifiers with Large Language Models
Mia Chiquier
Utkarsh Mall
Carl Vondrick
VLM
254
20
0
15 Apr 2024
Conditional Prototype Rectification Prompt Learning
Conditional Prototype Rectification Prompt Learning
Haoxing Chen
Yaohui Li
Zizheng Huang
Yan Hong
Zhuoer Xu
Zhangxuan Gu
Jun Lan
Huijia Zhu
Weiqiang Wang
VLM
231
3
0
15 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLMObjD
295
44
0
14 Apr 2024
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic
  Segmentation
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation
Jiannan Ge
Lingxi Xie
Hongtao Xie
Nianzu Yang
Xiaopeng Zhang
Yongdong Zhang
Qi Tian
VLM
292
5
0
08 Apr 2024
Contextual Chart Generation for Cyber Deception
Contextual Chart Generation for Cyber Deception
David D. Nguyen
David Liebowitz
Surya Nepal
S. Kanhere
Sharif Abuadbba
250
1
0
07 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
Homayoun Najjaran
OOD
313
1
0
05 Apr 2024
DeViDe: Faceted medical knowledge for improved medical vision-language
  pre-training
DeViDe: Faceted medical knowledge for improved medical vision-language pre-training
Haozhe Luo
Ziyu Zhou
Corentin Royer
Anjany Sekuboyina
Bjoern Menze
VLMViTMedIm
258
11
0
04 Apr 2024
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
  Interleaved Visual-Textual Tokens
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Deyao Zhu
Jian Ding
Mohamed Elhoseiny
VLM
234
116
0
04 Apr 2024
Cross-Modality Gait Recognition: Bridging LiDAR and Camera Modalities
  for Human Identification
Cross-Modality Gait Recognition: Bridging LiDAR and Camera Modalities for Human Identification
Rui Wang
Chuanfu Shen
M. Marín-Jiménez
George Q. Huang
Shiqi Yu
CVBM
232
9
0
04 Apr 2024
BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and
  Multilingual Exploration of Persuasion in Memes
BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in MemesInternational Workshop on Semantic Evaluation (SemEval), 2024
Amirhossein Abaskohi
AmirHossein Dabiri Aghdam
Lele Wang
Giuseppe Carenini
226
1
0
03 Apr 2024
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection
Mamadou Keita
W. Hamidouche
Hessen Bougueffa Eutamene
Abdenour Hadid
Abdelmalik Taleb-Ahmed
306
22
0
02 Apr 2024
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal
  Model Inference
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
Ruqi Liao
Chuqing Zhao
Jin Li
Weiqi Feng
67
0
0
02 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
239
3
0
01 Apr 2024
LLaMA-Excitor: General Instruction Tuning via Indirect Feature
  Interaction
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
236
8
0
01 Apr 2024
Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open
  Domain Generalization
Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
Mainak Singha
Ankit Jha
Shirsha Bose
Ashwin Nair
Moloud Abdar
Biplab Banerjee
VLM
206
23
0
31 Mar 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via
  Negations
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh
Ishaan Shrivastava
Mayank Vatsa
Richa Singh
Aparna Bharati
VLMCoGe
221
31
0
29 Mar 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint
  Textual and Visual Clues
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
151
0
0
29 Mar 2024
Semantic Map-based Generation of Navigation Instructions
Semantic Map-based Generation of Navigation Instructions
Chengzu Li
Chao Zhang
Simone Teufel
R. Doddipatla
Svetlana Stoyanchev
211
5
0
28 Mar 2024
Scaling Vision-and-Language Navigation With Offline RL
Scaling Vision-and-Language Navigation With Offline RL
Valay Bundele
Mahesh Bhupati
Biplab Banerjee
Aditya Grover
OffRL
183
1
0
27 Mar 2024
Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement
Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation EnhancementConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yuxuan Wang
Xiaoyuan Liu
VLM
275
0
0
24 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient
  Transfer Learning for Multi-modal Large Language Models
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Yiyi Zhou
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
MoE
184
2
0
22 Mar 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
350
28
0
22 Mar 2024
Grounding Spatial Relations in Text-Only Language Models
Grounding Spatial Relations in Text-Only Language Models
Gorka Azkune
Ander Salaberria
Eneko Agirre
192
2
0
20 Mar 2024
As Firm As Their Foundations: Can open-sourced foundation models be used
  to create adversarial examples for downstream tasks?
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?
Anjun Hu
Jindong Gu
Francesco Pinto
Konstantinos Kamnitsas
Juil Sock
AAMLSILM
251
9
0
19 Mar 2024
Modality-Agnostic fMRI Decoding of Vision and Language
Modality-Agnostic fMRI Decoding of Vision and Language
Mitja Nikolaus
Milad Mozafari
Nicholas Asher
Leila Reddy
Rufin VanRullen
189
4
0
18 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Prioritized Semantic Learning for Zero-shot Instance NavigationEuropean Conference on Computer Vision (ECCV), 2024
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
250
20
0
18 Mar 2024
Deciphering Hate: Identifying Hateful Memes and Their Targets
Deciphering Hate: Identifying Hateful Memes and Their Targets
E. Hossain
Omar Sharif
M. M. Hoque
S. Preum
202
8
0
16 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category DiscoveryComputer Vision and Pattern Recognition (CVPR), 2024
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
471
14
0
15 Mar 2024
PosSAM: Panoptic Open-vocabulary Segment Anything
PosSAM: Panoptic Open-vocabulary Segment Anything
VS Vibashan
Shubhankar Borse
Hyojin Park
Debasmit Das
Vishal M. Patel
Munawar Hayat
Fatih Porikli
VLMMLLM
193
8
0
14 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
516
244
0
14 Mar 2024
Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
Bo Shu
Zhouyao Zhu
Dong Shu
371
3
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Fan Yang
Jinqiao Wang
Jinqiao Wang
ObjD
294
26
0
14 Mar 2024
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained
  Ship Classification
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship ClassificationIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2024
Long Lan
Fengxiang Wang
Shuyan Li
Xiangtao Zheng
Zengmao Wang
Xinwang Liu
VLM
202
14
0
13 Mar 2024
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model
  Performance and Annotation Cost
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation CostInternational Conference on Language Resources and Evaluation (LREC), 2024
Oana Ignat
Longju Bai
Joan Nwatu
Amélie Reymond
201
8
0
12 Mar 2024
Noise-powered Multi-modal Knowledge Graph Representation Framework
Noise-powered Multi-modal Knowledge Graph Representation FrameworkInternational Conference on Computational Linguistics (COLING), 2024
Zhuo Chen
Yin Fang
Yichi Zhang
Lingbing Guo
Jiaoyan Chen
Hua-zeng Chen
Wen Zhang
Wen Zhang
194
0
0
11 Mar 2024
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context
  Misinformation Detection
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
Peng Qi
Zehong Yan
Wynne Hsu
Yang Deng
MLLM
294
86
0
05 Mar 2024
Vision-Language Models for Medical Report Generation and Visual Question
  Answering: A Review
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Iryna Hartsock
Ghulam Rasool
373
166
0
04 Mar 2024
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast
  Decoding
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen
Zhuokai Zhao
Hongyin Luo
Huaxiu Yao
Bo Li
Jiawei Zhou
MLLM
264
132
0
01 Mar 2024
Acquiring Linguistic Knowledge from Multimodal Input
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
284
4
0
27 Feb 2024
Vision Transformers with Natural Language Semantics
Vision Transformers with Natural Language Semantics
Young-Kyung Kim
Matías Di Martino
Guillermo Sapiro
ViT
153
7
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation
  Learning
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
322
9
0
27 Feb 2024
CARZero: Cross-Attention Alignment for Radiology Zero-Shot
  Classification
CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification
Zihang Jiang
Qingsong Yao
Zihang Jiang
Rongsheng Wang
Zhiyang He
Xiaodong Tao
S. Kevin Zhou
MedIm
299
36
0
27 Feb 2024
ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking
ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking
Yushan Han
Kaer Huang
163
1
0
27 Feb 2024
How Can LLM Guide RL? A Value-Based Approach
How Can LLM Guide RL? A Value-Based Approach
Shenao Zhang
Sirui Zheng
Shuqi Ke
Zhihan Liu
Wanxin Jin
Jianbo Yuan
Yingxiang Yang
Hongxia Yang
Zhaoran Wang
246
12
0
25 Feb 2024
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language
  Navigation
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
JIazhao Zhang
Kunyu Wang
Rongtao Xu
Gengze Zhou
Yicong Hong
Xiaomeng Fang
Qi Wu
Dongbin Zhao
Wang He
LM&Ro
657
153
0
24 Feb 2024
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long
Xuri Ge
R. McCreadie
Joemon M. Jose
331
12
0
23 Feb 2024
Previous
123...567...242526
Next