ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,232 papers shown
Efficient Multi-Modal Embeddings from Structured Data
Efficient Multi-Modal Embeddings from Structured Data
A. Vero
Ann A. Copestake
118
4
0
06 Oct 2021
Word Acquisition in Neural Language Models
Word Acquisition in Neural Language Models
Tyler A. Chang
Benjamin Bergen
268
46
0
05 Oct 2021
A Survey On Neural Word Embeddings
A Survey On Neural Word Embeddings
Erhan Sezerer
Selma Tekir
AI4TS
271
20
0
05 Oct 2021
ProTo: Program-Guided Transformer for Program-Guided Tasks
ProTo: Program-Guided Transformer for Program-Guided Tasks
Zelin Zhao
Karan Samel
Binghong Chen
Le Song
ViTLM&Ro
260
32
0
02 Oct 2021
Visually Grounded Concept Composition
Visually Grounded Concept Composition
Bowen Zhang
Hexiang Hu
Linlu Qiu
Peter Shaw
Fei Sha
CoGe
192
7
0
29 Sep 2021
Visually Grounded Reasoning across Languages and Cultures
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLMLRM
483
202
0
28 Sep 2021
Audio-to-Image Cross-Modal Generation
Audio-to-Image Cross-Modal GenerationIEEE International Joint Conference on Neural Network (IJCNN), 2021
Maciej Żelaszczyk
Jacek Mańdziuk
DiffM
202
20
0
27 Sep 2021
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual
  Question Answering
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question AnsweringConference on Computational Natural Language Learning (CoNLL), 2021
Ekta Sood
Fabian Kögel
Florian Strohm
Prajit Dhar
Andreas Bulling
172
21
0
27 Sep 2021
Why Do We Click: Visual Impression-aware News Recommendation
Why Do We Click: Visual Impression-aware News RecommendationACM Multimedia (ACM MM), 2021
Jiahao Xun
Shengyu Zhang
Zhou Zhao
Jieming Zhu
Tao Gui
Jingjie Li
Xiuqiang He
Xiaofei He
Tat-Seng Chua
Leilei Gan
238
39
0
26 Sep 2021
Systematic Generalization on gSCAN: What is Nearly Solved and What is
  Next?
Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Linlu Qiu
Hexiang Hu
Bowen Zhang
Peter Shaw
Fei Sha
172
23
0
25 Sep 2021
MLIM: Vision-and-Language Model Pre-training with Masked Language and
  Image Modeling
MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling
Tarik Arici
M. S. Seyfioglu
T. Neiman
Yi Tian Xu
Son N. Tran
Trishul Chilimbi
Belinda Zeng
Ismail B. Tutar
VLM
121
16
0
24 Sep 2021
CLIPort: What and Where Pathways for Robotic Manipulation
CLIPort: What and Where Pathways for Robotic ManipulationConference on Robot Learning (CoRL), 2021
Mohit Shridhar
Lucas Manuelli
Dieter Fox
LM&Ro
344
819
0
24 Sep 2021
Detecting Harmful Memes and Their Targets
Detecting Harmful Memes and Their TargetsFindings (Findings), 2021
Shraman Pramanick
Dimitar Dimitrov
Rituparna Mukherjee
Shivam Sharma
Md. Shad Akhtar
Preslav Nakov
Tanmoy Chakraborty
182
151
0
24 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLMVPVLMVLM
594
244
0
24 Sep 2021
Dense Contrastive Visual-Linguistic Pretraining
Dense Contrastive Visual-Linguistic PretrainingACM Multimedia (ACM MM), 2021
Lei Shi
Kai Shuang
Shijie Geng
Shiyang Feng
Zuohui Fu
Gerard de Melo
Yunpeng Chen
Sen Su
VLMSSL
240
12
0
24 Sep 2021
Transferring Knowledge from Vision to Language: How to Achieve it and
  how to Measure it?
Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2021
Tobias Norlund
Lovisa Hagström
Richard Johansson
275
25
0
23 Sep 2021
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and
  Benchmark
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and BenchmarkACM Multimedia (ACM MM), 2021
Xun Gao
Yin Zhao
Jie Zhang
Longjun Cai
142
9
0
23 Sep 2021
Cross-Modal Coherence for Text-to-Image Retrieval
Cross-Modal Coherence for Text-to-Image RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2021
Malihe Alikhani
Fangda Han
Hareesh Ravi
Mubbasir Kapadia
Vladimir Pavlovic
Matthew Stone
179
10
0
22 Sep 2021
Caption Enriched Samples for Improving Hateful Memes Detection
Caption Enriched Samples for Improving Hateful Memes DetectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Efrat Blaier
Itzik Malkiel
Lior Wolf
VLM
166
23
0
22 Sep 2021
COVR: A test-bed for Visually Grounded Compositional Generalization with
  real images
COVR: A test-bed for Visually Grounded Compositional Generalization with real imagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Ben Bogin
Shivanshu Gupta
Matt Gardner
Jonathan Berant
CoGe
180
30
0
22 Sep 2021
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
  Knowledge Distillation
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu
Chenfei Wu
Shao-Yen Tseng
Vasudev Lal
Xuming He
Nan Duan
CLIPVLM
283
32
0
22 Sep 2021
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLMViT
210
50
0
21 Sep 2021
ActionCLIP: A New Paradigm for Video Action Recognition
ActionCLIP: A New Paradigm for Video Action Recognition
Mengmeng Wang
Jiazheng Xing
Yong Liu
VLM
415
467
0
17 Sep 2021
An End-to-End Transformer Model for 3D Object Detection
An End-to-End Transformer Model for 3D Object Detection
Ishan Misra
Rohit Girdhar
Armand Joulin
3DPCViT
436
574
0
16 Sep 2021
A Survey on Temporal Sentence Grounding in Videos
A Survey on Temporal Sentence Grounding in Videos
Xiaohan Lan
Yitian Yuan
Xin Eric Wang
Zhi Wang
Wenwu Zhu
321
57
0
16 Sep 2021
Image Captioning for Effective Use of Language Models in Knowledge-Based
  Visual Question Answering
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
Aitor Soroa Etxabe
Eneko Agirre
301
73
0
15 Sep 2021
What Vision-Language Models `See' when they See Scenes
What Vision-Language Models `See' when they See Scenes
Michele Cafagna
Kees van Deemter
Albert Gatt
VLM
264
13
0
15 Sep 2021
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Da Yin
Liunian Harold Li
Ziniu Hu
Nanyun Peng
Kai-Wei Chang
300
65
0
14 Sep 2021
Discovering the Unknown Knowns: Turning Implicit Knowledge in the
  Dataset into Explicit Training Examples for Visual Question Answering
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering
Jihyung Kil
Cheng Zhang
D. Xuan
Wei-Lun Chao
264
23
0
13 Sep 2021
xGQA: Cross-Lingual Visual Question Answering
xGQA: Cross-Lingual Visual Question Answering
Jonas Pfeiffer
Gregor Geigle
Aishwarya Kamath
Jan-Martin O. Steitz
Stefan Roth
Ivan Vulić
Iryna Gurevych
362
80
0
13 Sep 2021
TEASEL: A Transformer-Based Speech-Prefixed Language Model
TEASEL: A Transformer-Based Speech-Prefixed Language Model
Mehdi Arjmand
M. Dousti
H. Moradi
147
23
0
12 Sep 2021
COSMic: A Coherence-Aware Generation Metric for Image Descriptions
COSMic: A Coherence-Aware Generation Metric for Image Descriptions
Mert Inan
P. Sharma
Baber Khalid
Radu Soricut
Matthew Stone
Malihe Alikhani
EGVM
156
14
0
11 Sep 2021
A Survey on Multi-modal Summarization
A Survey on Multi-modal Summarization
Anubhav Jangra
Sourajit Mukherjee
Adam Jatowt
S. Saha
M. Hasanuzzaman
206
79
0
11 Sep 2021
MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their
  Targets
MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Shraman Pramanick
Shivam Sharma
Dimitar Dimitrov
Md. Shad Akhtar
Preslav Nakov
Tanmoy Chakraborty
224
169
0
11 Sep 2021
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQAAAAI Conference on Artificial Intelligence (AAAI), 2021
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Yumao Lu
Zicheng Liu
Lijuan Wang
611
489
0
10 Sep 2021
Panoptic Narrative Grounding
Panoptic Narrative GroundingIEEE International Conference on Computer Vision (ICCV), 2021
Cristina González
Nicolás Ayobi
Isabela Hernández
José Hernández
Jordi Pont-Tuset
Pablo Arbeláez
258
28
0
10 Sep 2021
We went to look for meaning and all we got were these lousy
  representations: aspects of meaning representation for computational
  semantics
We went to look for meaning and all we got were these lousy representations: aspects of meaning representation for computational semantics
Simon Dobnik
R. Cooper
Adam Ek
Bill Noble
Staffan Larsson
N. Ilinykh
Vladislav Maraev
Vidya Somashekarappa
138
0
0
10 Sep 2021
Towards Developing a Multilingual and Code-Mixed Visual Question
  Answering System by Knowledge Distillation
Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge DistillationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
H. Khan
D. Gupta
Asif Ekbal
166
17
0
10 Sep 2021
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
  Multimodal Transformers
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Stella Frank
Emanuele Bugliarello
Desmond Elliott
184
94
0
09 Sep 2021
TxT: Crossmodal End-to-End Learning with Transformers
TxT: Crossmodal End-to-End Learning with TransformersGerman Conference on Pattern Recognition (DAGM), 2021
Jan-Martin O. Steitz
Jonas Pfeiffer
Iryna Gurevych
Stefan Roth
LRM
130
2
0
09 Sep 2021
M5Product: Self-harmonized Contrastive Learning for E-commercial
  Multi-modal Pretraining
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal PretrainingComputer Vision and Pattern Recognition (CVPR), 2021
Xiao Dong
Xunlin Zhan
Yangxin Wu
Yunchao Wei
Michael C. Kampffmeyer
Xiaoyong Wei
Minlong Lu
Yaowei Wang
Xiaodan Liang
589
46
0
09 Sep 2021
Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense
  in Text Generation Models
Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation ModelsAAAI Conference on Artificial Intelligence (AAAI), 2021
Steven Y. Feng
Kevin Lu
Zhuofu Tao
Malihe Alikhani
Teruko Mitamura
Eduard H. Hovy
Varun Gangal
LRM
226
14
0
08 Sep 2021
Self-supervised Contrastive Cross-Modality Representation Learning for
  Spoken Question Answering
Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Chenyu You
Polydoros Giannouris
Yuexian Zou
SSL
217
65
0
08 Sep 2021
Learning grounded word meaning representations on similarity graphs
Learning grounded word meaning representations on similarity graphsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Mariella Dimiccoli
H. Wendt
Pau Batlle
158
1
0
07 Sep 2021
CTRL-C: Camera calibration TRansformer with Line-Classification
CTRL-C: Camera calibration TRansformer with Line-ClassificationIEEE International Conference on Computer Vision (ICCV), 2021
Jinwoo Lee
Hyun-Young Go
Hyunjoon Lee
Sunghyun Cho
Minhyuk Sung
Junho Kim
ViT
208
46
0
06 Sep 2021
Learning to Generate Scene Graph from Natural Language Supervision
Learning to Generate Scene Graph from Natural Language Supervision
Yiwu Zhong
Jing Shi
Jianwei Yang
Chenliang Xu
Yin Li
SSL
263
85
0
06 Sep 2021
Data Efficient Masked Language Modeling for Vision and Language
Data Efficient Masked Language Modeling for Vision and Language
Yonatan Bitton
Gabriel Stanovsky
Michael Elhadad
Roy Schwartz
VLM
235
21
0
05 Sep 2021
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by
  Image and Caption Generation
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Shaikh
Zhanghexuan Ji
Dana Moukheiber
Yan Shen
S. Srihari
Mingchen Gao
VLM
171
1
0
04 Sep 2021
Weakly Supervised Relative Spatial Reasoning for Visual Question
  Answering
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering
Pratyay Banerjee
Tejas Gokhale
Yezhou Yang
Chitta Baral
LRM
163
19
0
04 Sep 2021
Supervised Contrastive Learning for Multimodal Unreliable News Detection
  in COVID-19 Pandemic
Supervised Contrastive Learning for Multimodal Unreliable News Detection in COVID-19 Pandemic
Wenjia Zhang
Lin Gui
Yulan He
141
43
0
04 Sep 2021
Previous
123...343536...434445
Next
Page 35 of 45
Pageof 45