ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSL
    VLM
ArXivPDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,088 papers shown
Title
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
51
11
0
26 Sep 2024
Path-adaptive Spatio-Temporal State Space Model for Event-based
  Recognition with Arbitrary Duration
Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration
Jiazhou Zhou
Kanghao Chen
Lei Zhang
Lin Wang
24
0
0
25 Sep 2024
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic
  Segmentation
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
Soojin Jang
Jungmin Yun
Junehyoung Kwon
Eunju Lee
Youngbin Kim
38
3
0
24 Sep 2024
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Hong Chen
Xin Wang
Yuwei Zhou
Bin Huang
Yipeng Zhang
Wei Feng
Houlun Chen
Zeyang Zhang
Siao Tang
Wenwu Zhu
DiffM
47
7
0
23 Sep 2024
LARE: Latent Augmentation using Regional Embedding with Vision-Language
  Model
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
Kosuke Sakurai
Tatsuya Ishii
Ryotaro Shimizu
Linxin Song
Masayuki Goto
VLM
26
0
0
19 Sep 2024
Multi-Cohort Framework with Cohort-Aware Attention and Adversarial
  Mutual-Information Minimization for Whole Slide Image Classification
Multi-Cohort Framework with Cohort-Aware Attention and Adversarial Mutual-Information Minimization for Whole Slide Image Classification
Sharon Peled
Y. Maruvka
Moti Freiman
26
1
0
17 Sep 2024
Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation
Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation
Qilong Zhangli
Di Liu
Abhishek Aich
Dimitris Metaxas
S. Schulter
31
0
0
15 Sep 2024
PROSE-FD: A Multimodal PDE Foundation Model for Learning Multiple
  Operators for Forecasting Fluid Dynamics
PROSE-FD: A Multimodal PDE Foundation Model for Learning Multiple Operators for Forecasting Fluid Dynamics
Yuxuan Liu
Jingmin Sun
Xinjie He
Griffin Pinney
Zecheng Zhang
Hayden Schaeffer
AI4CE
35
5
0
15 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGe
VLM
30
0
0
12 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP
  Perspective
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Z. Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
30
5
0
11 Sep 2024
What to align in multimodal contrastive learning?
What to align in multimodal contrastive learning?
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
22
3
0
11 Sep 2024
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large
  Language Model
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
Zhen Yang
Jinhao Chen
Zhengxiao Du
Wenmeng Yu
Weihan Wang
Wenyi Hong
Zhihuan Jiang
Bin Xu
Yuxiao Dong
Jie Tang
VLM
LRM
32
8
0
10 Sep 2024
VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage
  $\underline{P}$re-training Framework for $\underline{Ro}$botic and
  Laparoscopic Surgery
VidLPRO: A Vid‾\underline{Vid}Vid​eo-L‾\underline{L}L​anguage P‾\underline{P}P​re-training Framework for Ro‾\underline{Ro}Ro​botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
58
1
0
07 Sep 2024
MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with
  Missing Modality
MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality
Ruiting Dai
Yuqiao Tan
Lisi Mo
Tao He
Ke Qin
Shuang Liang
VLM
28
2
0
07 Sep 2024
Vision-Language Navigation with Continual Learning
Vision-Language Navigation with Continual Learning
Zhiyuan Li
Yanfeng Lv
Ziqin Tu
Di Shang
Hong Qiao
32
2
0
04 Sep 2024
CV-Probes: Studying the interplay of lexical and world knowledge in
  visually grounded verb understanding
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
Ivana Beňová
Michal Gregor
Albert Gatt
32
0
0
02 Sep 2024
HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution
HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution
Masoomeh Aslahishahri
Jordan R. Ubbens
Ian Stavness
29
0
0
30 Aug 2024
VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
Zaiwei Zhang
Gregory P. Meyer
Zhichao Lu
Ashish Shrivastava
Avinash Ravichandran
Eric M. Wolff
VLM
31
2
0
29 Aug 2024
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with
  Multi-Pass Augmented Generative Error Correction
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
Yuka Ko
Sheng Li
Chao-Han Huck Yang
Tatsuya Kawahara
AuLLM
29
3
0
29 Aug 2024
Pixels to Prose: Understanding the art of Image Captioning
Pixels to Prose: Understanding the art of Image Captioning
Hrishikesh Singh
Aarti Sharma
Millie Pant
3DV
VLM
25
0
0
28 Aug 2024
Meta-Learn Unimodal Signals with Weak Supervision for Multimodal
  Sentiment Analysis
Meta-Learn Unimodal Signals with Weak Supervision for Multimodal Sentiment Analysis
Sijie Mai
Yu Zhao
Ying Zeng
Jianhua Yao
Haifeng Hu
25
2
0
28 Aug 2024
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages
  in Multimodal Image Retrieval Task
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
Ali Asgarov
Samir Rustamov
VLM
21
1
0
25 Aug 2024
Probing the Robustness of Vision-Language Pretrained Models: A
  Multimodal Adversarial Attack Approach
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach
Jiwei Guan
Tianyu Ding
Longbing Cao
Lei Pan
Chen Wang
Xi Zheng
AAML
31
1
0
24 Aug 2024
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in
  Visual Question Answering
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering
Yuliang Cai
Mohammad Rostami
CLL
VLM
MLLM
35
2
0
21 Aug 2024
C${^2}$RL: Content and Context Representation Learning for Gloss-free
  Sign Language Translation and Retrieval
C2{^2}2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLM
SLR
29
1
0
19 Aug 2024
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Qizhou Chen
Taolin Zhang
Chengyu Wang
Xiaofeng He
Dakan Wang
Tingting Liu
KELM
44
2
0
19 Aug 2024
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted
  Attack for Image-to-Text Models
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models
Qingyuan Zeng
Zhenzhong Wang
Yiu-ming Cheung
Min Jiang
AAML
40
1
0
16 Aug 2024
Multi-Modal Dialogue State Tracking for Playing GuessWhich Game
Multi-Modal Dialogue State Tracking for Playing GuessWhich Game
Wei Pang
Ruixue Duan
Jinfu Yang
Ning Li
25
0
0
15 Aug 2024
Adaptive Learning of Consistency and Inconsistency Information for Fake
  News Detection
Adaptive Learning of Consistency and Inconsistency Information for Fake News Detection
Aohan Li
Jiaxin Chen
Xin Liao
Dengyong Zhang
21
1
0
15 Aug 2024
IIU: Independent Inference Units for Knowledge-based Visual Question
  Answering
IIU: Independent Inference Units for Knowledge-based Visual Question Answering
Yili Li
Jing Yu
Keke Gai
Gang Xiong
16
0
0
15 Aug 2024
Modality Invariant Multimodal Learning to Handle Missing Modalities: A
  Single-Branch Approach
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach
Muhammad Saad Saeed
Shah Nawaz
Muhammad Zaigham Zaheer
Muhammad Haris Khan
Karthik Nandakumar
Muhammad Haroon Yousaf
Hassan Sajjad
Tom De Schepper
Markus Schedl
18
0
0
14 Aug 2024
Dual-Domain CLIP-Assisted Residual Optimization Perception Model for
  Metal Artifact Reduction
Dual-Domain CLIP-Assisted Residual Optimization Perception Model for Metal Artifact Reduction
Xinrui Zhang
Ailong Cai
Shaoyu Wang
Linyuan Wang
Zhizhong Zheng
Lei Li
Bin Yan
MedIm
19
0
0
14 Aug 2024
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
Fan Yang
Sicheng Zhao
Yanhao Zhang
Haoxiang Chen
Hui Chen
Wenbo Tang
Guiguang Ding
33
4
0
14 Aug 2024
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product
  Retrieval
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
Ruixiang Zhao
Jian Jia
Yan Li
Xuehan Bai
Quan Chen
Han Li
Peng Jiang
Xirong Li
28
0
0
06 Aug 2024
Towards Coarse-grained Visual Language Navigation Task Planning Enhanced
  by Event Knowledge Graph
Towards Coarse-grained Visual Language Navigation Task Planning Enhanced by Event Knowledge Graph
Zhao Kaichen
Song Yaoxian
Zhao Haiquan
Liu Haoyu
Li Tiefeng
Li Zhixu
28
0
0
05 Aug 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Yu Qiao
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
62
48
0
05 Aug 2024
A Novel Evaluation Framework for Image2Text Generation
A Novel Evaluation Framework for Image2Text Generation
Jia-Hong Huang
Hongyi Zhu
Yixian Shen
S. Rudinac
A. M. Pacces
Evangelos Kanoulas
37
7
0
03 Aug 2024
Actra: Optimized Transformer Architecture for Vision-Language-Action
  Models in Robot Learning
Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning
Yueen Ma
Dafeng Chi
Shiguang Wu
Yuecheng Liu
Yuzheng Zhuang
Jianye Hao
Irwin King
34
5
0
02 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for
  Multi-Task Visual Grounding
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen
Mahdieh Hatamian
Yu Wu
43
3
0
02 Aug 2024
MarvelOVD: Marrying Object Recognition and Vision-Language Models for
  Robust Open-Vocabulary Object Detection
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
Kuo Wang
Lechao Cheng
Weikai Chen
Pingping Zhang
Liang Lin
Fan Zhou
Guanbin Li
VLM
ObjD
26
1
0
31 Jul 2024
Advancing Vietnamese Visual Question Answering with Transformer and
  Convolutional Integration
Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration
Ngoc Son Nguyen
Van Son Nguyen
Tung Le
ViT
38
0
0
30 Jul 2024
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger
  Visual Cues
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
33
6
0
29 Jul 2024
FlexAttention for Efficient High-Resolution Vision-Language Models
FlexAttention for Efficient High-Resolution Vision-Language Models
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
67
4
0
29 Jul 2024
Multi-modal Crowd Counting via Modal Emulation
Multi-modal Crowd Counting via Modal Emulation
Chenhao Wang
Xiaopeng Hong
Zhiheng Ma
Yupeng Wei
Yabin Wang
Xiaopeng Fan
27
1
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
32
0
0
28 Jul 2024
FakingRecipe: Detecting Fake News on Short Video Platforms from the
  Perspective of Creative Process
FakingRecipe: Detecting Fake News on Short Video Platforms from the Perspective of Creative Process
Yuyan Bu
Qiang Sheng
Juan Cao
Peng Qi
Danding Wang
Jintao Li
DiffM
35
8
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
29
0
0
23 Jul 2024
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming
  Product Retrieval
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu
Yiyi Chen
Yan Li
Minquan Wang
Haoqian Wang
Quan Chen
Han Li
Peng Jiang
AI4TS
19
0
0
23 Jul 2024
Chameleon: Images Are What You Need For Multimodal Learning Robust To
  Missing Modalities
Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities
Muhammad Irzam Liaqat
Shah Nawaz
Muhammad Zaigham Zaheer
M. S. Saeed
Hassan Sajjad
Tom De Schepper
Karthik Nandakumar
Muhammad Haris Khan
21
1
0
23 Jul 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual
  Question Answering with Large Language Models
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
40
4
0
22 Jul 2024
Previous
12345...404142
Next