ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2108.10904
  4. Cited By
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

24 August 2021
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
    VLM
    MLLM
ArXivPDFHTML

Papers citing "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision"

50 / 565 papers shown
Title
EventBind: Learning a Unified Representation to Bind Them All for
  Event-based Open-world Understanding
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
Jiazhou Zhou
Xueye Zheng
Yuanhuiyi Lyu
Lin Wang
VLM
17
12
0
06 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
16
599
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and
  Understanding of the Open World
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
33
83
0
03 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge
  using Vision-Language Pre-Training Model
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIP
VLM
22
10
0
02 Aug 2023
CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image
  Classification
CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
R. Abdelfattah
Qing-Wu Guo
Xiaoguang Li
Xiaofeng Wang
Song Wang
VLM
8
22
0
31 Jul 2023
Transferable Decoding with Visual Entities for Zero-Shot Image
  Captioning
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei
Teng Wang
Jinrui Zhang
Zhenyu He
Chengjie Wang
Feng Zheng
VLM
19
33
0
31 Jul 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
  Complex Visual Reasoning Tasks
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
23
2
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
21
42
0
30 Jul 2023
Cross-Modal Concept Learning and Inference for Vision-Language Models
Cross-Modal Concept Learning and Inference for Vision-Language Models
Yi Zhang
Ce Zhang
Yushun Tang
Z. He
VLM
MLLM
CLIP
15
15
0
28 Jul 2023
Exploring Annotation-free Image Captioning with Retrieval-augmented
  Pseudo Sentence Generation
Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation
Zhiyuan Li
Dongnan Liu
Heng Wang
Chaoyi Zhang
Weidong (Tom) Cai
RALM
20
0
0
27 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
13
116
0
25 Jul 2023
CLIP-KD: An Empirical Study of CLIP Model Distillation
CLIP-KD: An Empirical Study of CLIP Model Distillation
Chuanguang Yang
Zhulin An
Libo Huang
Junyu Bi
Xinqiang Yu
Hansheng Yang
Boyu Diao
Yongjun Xu
VLM
16
25
0
24 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future
  Challenges
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
13
18
0
21 Jul 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
Meta-Transformer: A Unified Framework for Multimodal Learning
Yiyuan Zhang
Kaixiong Gong
Kaipeng Zhang
Hongsheng Li
Yu Qiao
Wanli Ouyang
Xiangyu Yue
13
135
0
20 Jul 2023
Improving Multimodal Datasets with Image Captioning
Improving Multimodal Datasets with Image Captioning
Thao Nguyen
S. Gadre
Gabriel Ilharco
Sewoong Oh
Ludwig Schmidt
VLM
17
69
0
19 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
  Patch Summarization
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
13
9
0
17 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language
  Pre-training
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
16
25
0
13 Jul 2023
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle
Abhay Jain
Radu Timofte
Goran Glavavs
VLM
MLLM
11
29
0
13 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
27
126
0
11 Jul 2023
Self-Adaptive Sampling for Efficient Video Question-Answering on
  Image--Text Models
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Wei Han
Hui Chen
MingSung Kan
Soujanya Poria
18
1
0
09 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
10
5
0
06 Jul 2023
Distilling Large Vision-Language Model with Out-of-Distribution
  Generalizability
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Xuanlin Li
Yunhao Fang
Minghua Liu
Z. Ling
Z. Tu
Haoran Su
VLM
26
21
0
06 Jul 2023
T-MARS: Improving Visual Representations by Circumventing Text Feature
  Learning
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Pratyush Maini
Sachin Goyal
Zachary Chase Lipton
J. Zico Kolter
Aditi Raghunathan
VLM
21
33
0
06 Jul 2023
MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential
  Deepfake Detection
MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection
Ruiyang Xia
Decheng Liu
Jie Li
Lin Yuan
N. Wang
Xinbo Gao
15
17
0
06 Jul 2023
Human Inspired Progressive Alignment and Comparative Learning for
  Grounded Word Acquisition
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
Yuwei Bao
B. Lattimer
J. Chai
CLL
30
1
0
05 Jul 2023
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models
Uddeshya Upadhyay
Shyamgopal Karthik
Massimiliano Mancini
Zeynep Akata
MLLM
VLM
11
3
0
01 Jul 2023
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
  Question Answering
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
A. S. Penamakuri
Manish Gupta
Mithun Das Gupta
Anand Mishra
23
7
0
29 Jun 2023
A Survey on Multimodal Large Language Models
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Bill Xu
Enhong Chen
MLLM
LRM
33
551
0
23 Jun 2023
Improving Image Captioning Descriptiveness by Ranking and LLM-based
  Fusion
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
Simone Bianco
Luigi Celona
Marco Donzella
Paolo Napoletano
24
18
0
20 Jun 2023
Align, Adapt and Inject: Sound-guided Unified Image Generation
Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang
Kaipeng Zhang
Yuying Ge
Wenqi Shao
Zeyue Xue
Yu Qiao
Ping Luo
DiffM
16
5
0
20 Jun 2023
MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained
  Vision-Language Models
MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
Yongzhu Miao
Shasha Li
Jintao Tang
Ting Wang
VLM
MLLM
VPVLM
12
2
0
20 Jun 2023
M3PT: A Multi-Modal Model for POI Tagging
M3PT: A Multi-Modal Model for POI Tagging
Jingsong Yang
Guanzhou Han
Deqing Yang
Jingping Liu
Yanghua Xiao
Xiang Xu
Baohua Wu
Shenghua Ni
40
3
0
16 Jun 2023
Text Promptable Surgical Instrument Segmentation with Vision-Language
  Models
Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Zijian Zhou
Oluwatosin O. Alabi
Meng Wei
Tom Kamiel Magda Vercauteren
Miaojing Shi
MedIm
10
22
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
J. Liu
VLM
CLIP
17
5
0
15 Jun 2023
Exploring the Application of Large-scale Pre-trained Models on Adverse
  Weather Removal
Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal
Zhentao Tan
Yue-bo Wu
Qiankun Liu
Qi Chu
Le Lu
Jieping Ye
Nenghai Yu
27
11
0
15 Jun 2023
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Kelly O. Marshall
Minh Pham
Ameya Joshi
Anushrut Jignasu
Aditya Balu
Adarsh Krishnamurthy
A. Hegde
CLIP
10
2
0
14 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
9
53
0
13 Jun 2023
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for
  Histopathology Images
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Ming Y. Lu
Bowen Chen
Andrew Zhang
Drew F. K. Williamson
Richard J. Chen
Tong Ding
L. Le
Yung-Sung Chuang
Faisal Mahmood
VLM
MedIm
17
67
0
13 Jun 2023
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Zeju Qiu
Wei-yu Liu
Haiwen Feng
Yuxuan Xue
Yao Feng
Zhen Liu
Dan Zhang
Adrian Weller
Bernhard Schölkopf
DiffM
30
134
0
12 Jun 2023
Scalable 3D Captioning with Pretrained Models
Scalable 3D Captioning with Pretrained Models
Tiange Luo
C. Rockwell
Honglak Lee
Justin Johnson
11
151
0
12 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language
  Pre-training
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
10
1
0
12 Jun 2023
A Comprehensive Survey on Applications of Transformers for Deep Learning
  Tasks
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Saidul Islam
Hanae Elmekki
Ahmed Elsebai
Jamal Bentahar
Najat Drawel
Gaith Rjoub
Witold Pedrycz
ViT
MedIm
8
167
0
11 Jun 2023
Embodied Executable Policy Learning with Language-based Scene
  Summarization
Embodied Executable Policy Learning with Language-based Scene Summarization
Jielin Qiu
Mengdi Xu
William Jongwon Han
Seungwhan Moon
Ding Zhao
LM&Ro
19
7
0
09 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
  Tasks? A: Self-Train on Unlabeled Images!
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLM
MLLM
16
17
0
06 Jun 2023
Diversifying Joint Vision-Language Tokenization Learning
Diversifying Joint Vision-Language Tokenization Learning
Vardaan Pahuja
A. Piergiovanni
A. Angelova
16
0
0
06 Jun 2023
MetaVL: Transferring In-Context Learning Ability From Language Models to
  Vision-Language Models
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
Masoud Monajatipoor
Liunian Harold Li
Mozhdeh Rouhsedaghat
Lin F. Yang
Kai-Wei Chang
MLLM
LRM
11
9
0
02 Jun 2023
Vocabulary-free Image Classification
Vocabulary-free Image Classification
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
26
22
0
01 Jun 2023
Adapting Pre-trained Language Models to Vision-Language Tasks via
  Dynamic Visual Prompting
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Shubin Huang
Qiong Wu
Yiyi Zhou
Weijie Chen
Rongsheng Zhang
Xiaoshuai Sun
Rongrong Ji
VLM
VPVLM
LRM
16
0
0
01 Jun 2023
PV2TEA: Patching Visual Modality to Textual-Established Information
  Extraction
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction
Hejie Cui
Rongmei Lin
Nasser Zalmout
Chenwei Zhang
Jingbo Shang
Carl Yang
Xian Li
VLM
8
2
0
01 Jun 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for
  Vision-Language Representation Learning
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu
Bei Li
Chenfei Wu
Shao-Yen Tseng
Anahita Bhiwandiwalla
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
AIFin
VLM
21
2
0
31 May 2023
Previous
123456...101112
Next