ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.03334
  4. Cited By
ViLT: Vision-and-Language Transformer Without Convolution or Region
  Supervision

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

5 February 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
    VLM
    CLIP
ArXivPDFHTML

Papers citing "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

50 / 254 papers shown
Title
Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities
Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities
Jueqing Lu
Yuanyuan Qi
Xiaohao Yang
Shujie Zhou
Lan Du
26
0
0
13 May 2025
PREMISE: Matching-based Prediction for Accurate Review Recommendation
PREMISE: Matching-based Prediction for Accurate Review Recommendation
Wei Han
Hui Chen
Soujanya Poria
29
0
0
02 May 2025
MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation
MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation
Amaan Izhar
Nurul Japar
Norisma Idris
Ting Dang
MoE
64
0
0
29 Apr 2025
SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
Chengzhi Wu
Yuxin Wan
Hao Fu
Julius Pfrommer
Zeyun Zhong
Junwei Zheng
Jiaming Zhang
Jürgen Beyerer
3DPC
54
0
0
28 Apr 2025
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Yassir Benhammou
Alessandro Tiberio
Gabriel Trautmann
Suman Kalyan
MLLM
VLM
36
0
0
21 Apr 2025
VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction
VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction
Zizhi Chen
Minghao Han
Xukun Zhang
Shuwei Ma
Tao Liu
Xing Wei
L. Zhang
44
0
0
25 Mar 2025
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Zengrong Lin
Zheng Wang
Tianwen Qian
Pan Mu
Sixian Chan
Cong Bai
42
0
0
13 Mar 2025
Visual Adaptive Prompting for Compositional Zero-Shot Learning
Visual Adaptive Prompting for Compositional Zero-Shot Learning
Kyle Stein
A. Mahyari
Guillermo A. Francia
Eman El-Sheikh
VLM
CoGe
138
1
0
27 Feb 2025
Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
Prajwal Gatti
Kshitij Parikh
Dhriti Prasanna Paul
Manish Gupta
Anand Mishra
110
2
0
12 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
62
3
0
11 Feb 2025
Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap
Srivatsa Mallapragada
Ying Xie
Varsha Rani Chawan
Zeyad Hailat
Yuanbo Wang
36
0
0
28 Jan 2025
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
Varun Biyyala
Bharat Chanderprakash Kathuria
Jialu Li
Youshan Zhang
50
0
0
13 Jan 2025
Multimodal semantic retrieval for product search
Multimodal semantic retrieval for product search
Dong Liu
Esther Lopez Ramos
41
0
0
13 Jan 2025
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye
Honglin Lin
Leyan Ou
Dairong Chen
Zihao Wang
Conghui He
Weijia Li
Weijia Li
76
0
0
22 Dec 2024
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
Yue Zhang
Liqiang Jing
Vibhav Gogate
116
2
0
19 Dec 2024
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang
Tang Li
Kien X. Nguyen
Xi Peng
70
0
0
17 Dec 2024
Learning to Reason Iteratively and Parallelly for Complex Visual
  Reasoning Scenarios
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
71
2
0
20 Nov 2024
ViTOC: Vision Transformer and Object-aware Captioner
ViTOC: Vision Transformer and Object-aware Captioner
Feiyang Huang
25
0
0
09 Nov 2024
Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification
Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification
Shengxun Wei
Zan Gao
Yibo Zhao
Weili Guan
Weili Guan
Shengyong Chen
46
1
0
01 Nov 2024
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang
Maixuan Xue
Xinran Liu
Zheng Pan
Xing Wei
40
1
0
31 Oct 2024
ATLAS: Adapter-Based Multi-Modal Continual Learning with a Two-Stage
  Learning Strategy
ATLAS: Adapter-Based Multi-Modal Continual Learning with a Two-Stage Learning Strategy
Hong Li
Zhiquan Tan
Xingyu Li
Weiran Huang
CLL
MoMe
26
1
0
14 Oct 2024
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Lianyu Hu
Tongkai Shi
Wei Feng
Fanhua Shang
Liang Wan
VLM
29
1
0
09 Oct 2024
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Dianzhi Yu
Xinni Zhang
Yankai Chen
Aiwei Liu
Yifei Zhang
Philip S. Yu
Irwin King
VLM
CLL
39
9
0
07 Oct 2024
VISTA: A Visual and Textual Attention Dataset for Interpreting
  Multimodal Models
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models
Harshit
Tolga Tasdizen
CoGe
VLM
28
1
0
06 Oct 2024
Generalizable Prompt Tuning for Vision-Language Models
Generalizable Prompt Tuning for Vision-Language Models
Qian Zhang
VLM
VPVLM
45
0
0
04 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
35
0
0
01 Oct 2024
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
51
11
0
26 Sep 2024
The Roles of Generative Artificial Intelligence in Internet of Electric
  Vehicles
The Roles of Generative Artificial Intelligence in Internet of Electric Vehicles
Hanwen Zhang
Dusit Niyato
Wei Zhang
Changyuan Zhao
Hongyang Du
Abbas Jamalipour
Sumei Sun
Yiyang Pei
AI4CE
37
2
0
24 Sep 2024
Embodiment-Agnostic Action Planning via Object-Part Scene Flow
Embodiment-Agnostic Action Planning via Object-Part Scene Flow
Weiliang Tang
Jia-Hui Pan
Wei Zhan
Jianshu Zhou
Huaxiu Yao
Yun-Hui Liu
M. Tomizuka
Mingyu Ding
Chi-Wing Fu
41
0
0
16 Sep 2024
MaskMol: Knowledge-guided Molecular Image Pre-Training Framework for
  Activity Cliffs
MaskMol: Knowledge-guided Molecular Image Pre-Training Framework for Activity Cliffs
Zhixiang Cheng
Hongxin Xiang
Pengsen Ma
Li Zeng
Xin Jin
...
Yang Deng
Bosheng Song
Xinxin Feng
Changhui Deng
Xiangxiang Zeng
24
0
0
02 Sep 2024
End-to-end Semantic-centric Video-based Multimodal Affective Computing
End-to-end Semantic-centric Video-based Multimodal Affective Computing
Ronghao Lin
Ying Zeng
Sijie Mai
Haifeng Hu
VGen
40
0
0
14 Aug 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
51
24
0
28 Jun 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for
  Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLM
LRM
34
1
0
25 Jun 2024
MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning
MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning
Jiali Cheng
Hadi Amiri
BDL
33
3
0
21 Jun 2024
Benchmarking Vision-Language Contrastive Methods for Medical
  Representation Learning
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning
Shuvendu Roy
Yasaman Parhizkar
Franklin Ogidi
Vahid Reza Khazaie
Michael Colacci
Ali Etemad
Elham Dolatabadi
Arash Afkanpour
VLM
40
1
0
11 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
38
13
0
08 Jun 2024
Enhancing Large Vision Language Models with Self-Training on Image
  Comprehension
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Y. Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
36
36
0
30 May 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
Laura Fieback
Jakob Spiegelberg
Hanno Gottschalk
MLLM
57
5
0
29 May 2024
Accelerating Transformers with Spectrum-Preserving Token Merging
Accelerating Transformers with Spectrum-Preserving Token Merging
Hoai-Chau Tran
D. M. Nguyen
Duy M. Nguyen
Trung Thanh Nguyen
Ngan Le
Pengtao Xie
Daniel Sonntag
James Y. Zou
Binh T. Nguyen
Mathias Niepert
32
8
0
25 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
67
41
0
23 May 2024
Optimizing Universal Lesion Segmentation: State Space Model-Guided
  Hierarchical Networks with Feature Importance Adjustment
Optimizing Universal Lesion Segmentation: State Space Model-Guided Hierarchical Networks with Feature Importance Adjustment
Kazi Shahriar Sanjid
Md. Tanzim Hossain
Md. Shakib Shahariar Junayed
M. M. Uddin
Mamba
35
2
0
26 Apr 2024
Closed Loop Interactive Embodied Reasoning for Robot Manipulation
Closed Loop Interactive Embodied Reasoning for Robot Manipulation
Michal Nazarczuk
Jan Kristof Behrens
Karla Stepanova
Matej Hoffmann
K. Mikolajczyk
LM&Ro
LRM
36
1
0
23 Apr 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based
  Visual Question Answering
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
35
1
0
19 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
18
5
0
12 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
42
1
0
01 Apr 2024
MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention
  Editing in Text-to-Image Diffusion Models
MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models
Hidir Yesiltepe
Kiymet Akdemir
Pinar Yanardag
29
3
0
28 Mar 2024
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Jingjing Hu
Dan Guo
Kun Li
Zhan Si
Xun Yang
Xiaojun Chang
Meng Wang
59
3
0
21 Mar 2024
Grounding Spatial Relations in Text-Only Language Models
Grounding Spatial Relations in Text-Only Language Models
Gorka Azkune
Ander Salaberria
Eneko Agirre
34
0
0
20 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
25
14
0
06 Mar 2024
123456
Next