ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
    FaML
ArXivPDFHTML

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,191 papers shown
Title
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
34
0
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Y. Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
55
0
0
07 May 2025
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding
Feng Xiao
Hongbin Xu
Guocan Zhao
Wenxiong Kang
41
0
0
07 May 2025
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
43
0
0
05 May 2025
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
Yuchen Wang
X. Bai
X. Li
Weili Guan
Liqiang Nie
Xinyang Chen
VLM
42
0
0
04 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
34
0
0
04 May 2025
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Wenfa Wu
Guanyu Zhang
Zheng Tan
Yi Wang
Hongsheng Qi
AI4TS
35
1
0
02 May 2025
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
Andrei-Alexandru Manea
Jindřich Libovický
VLM
52
0
0
30 Apr 2025
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Shahad Albastaki
Anabia Sohail
I. I. Ganapathi
B. Alawode
Asim Khan
Sajid Javed
N. Werghi
Mohammed Bennamoun
Arif Mahmood
66
0
0
26 Apr 2025
Multimodal graph representation learning for website generation based on visual sketch
Multimodal graph representation learning for website generation based on visual sketch
Tung D. Vu
Chung Hoang
Truong-Son Hy
3DV
48
0
0
25 Apr 2025
Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation
Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation
Zhuang Yu
Shiliang Sun
Jing Zhao
Tengfei Song
Hao-Yu Yang
48
0
0
25 Apr 2025
PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
Kai Cui
J. Li
Y. Liu
Xuesong Zhang
Zhenzhen Hu
M. Wang
41
0
0
24 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Symbolic Representation for Any-to-Any Generative Tasks
J. Chen
Xiaoye Zhu
Y. Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
J. Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
36
0
0
24 Apr 2025
Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis
Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis
Zichuan Liu
Liming Jiang
Qing Yan
Yumin Jia
Hao Kang
Xin Lu
DiffM
29
0
0
19 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Hao Yin
Gunagzong Si
Zilei Wang
59
0
0
14 Apr 2025
GFT: Gradient Focal Transformer
GFT: Gradient Focal Transformer
Boris Kriuk
Simranjit Kaur Gill
Shoaib Aslam
Amir Fakhrutdinov
31
0
0
14 Apr 2025
UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
Yating Liu
Yaowei Li
Xiangyuan Lan
Wenming Yang
Zimo Liu
Q. Liao
24
0
0
14 Apr 2025
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Yongchao Feng
Yajie Liu
Shuai Yang
Wenrui Cai
J. Zhang
...
Jiahui Lv
Z. Liu
Tengyuan Shi
Qingjie Liu
Y. Wang
MLLM
VLM
55
1
0
13 Apr 2025
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
J. Wu
Hao Yang
Xinhua Zeng
Guibing He
Z. Chen
Z. Li
X. Zhang
Yangyang Ma
Run Fang
Yang Liu
LRM
58
0
0
12 Apr 2025
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang
Xuhui Zheng
X. Wu
Chong Peng
Xuezhi Cao
30
0
0
10 Apr 2025
Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction
Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction
Kyoyun Choi
Byungmu Yoon
Soobum Kim
Jonggwon Park
31
0
0
10 Apr 2025
Pose-Aware Weakly-Supervised Action Segmentation
Pose-Aware Weakly-Supervised Action Segmentation
Seth Z. Zhao
Reza Ghoddoosian
Isht Dwivedi
Nakul Agarwal
Behzad Dariush
26
0
0
08 Apr 2025
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
Chunhui Zhang
Li Liu
Jialin Gao
Xin Sun
Hao Wen
Xi Zhou
Shiming Ge
Y. Wang
33
0
0
02 Apr 2025
Spingarn's Method and Progressive Decoupling Beyond Elicitable Monotonicity
Spingarn's Method and Progressive Decoupling Beyond Elicitable Monotonicity
B. Evens
P. Latafat
Panagiotis Patrinos
46
0
0
01 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
J. Liu
P. Wang
Jing Tao
Zhou Su
45
0
0
01 Apr 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
S. Voloshynovskiy
DiffM
53
0
0
31 Mar 2025
ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model
ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model
Shuhei Tarashima
Xinqi Shu
Norio Tagawa
VLM
46
0
0
30 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
36
0
0
29 Mar 2025
Model Assembly Learning with Heterogeneous Layer Weight Merging
Model Assembly Learning with Heterogeneous Layer Weight Merging
Yi-Kai Zhang
Jin Wang
Xu-Xiang Zhong
De-Chuan Zhan
Han-Jia Ye
MoMe
47
0
0
27 Mar 2025
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Haoqiang Lin
Haokun Wen
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
52
14
0
25 Mar 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Shuo Yang
Siwen Luo
S. Han
Eduard Hovy
LRM
34
0
0
24 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
48
1
0
21 Mar 2025
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Suho Yoo
Hyunjong Ok
Jaeho Lee
AuLLM
RALM
51
0
0
21 Mar 2025
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Zichen Liu
Kunlun Xu
Bing-Huang Su
Xu Zou
Yuxin Peng
Jiahuan Zhou
VLM
AI4TS
65
1
0
20 Mar 2025
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Yuchen Ren
Zhengyu Zhao
Chenhao Lin
Bo Yang
Lu Zhou
Zhe Liu
Chao Shen
ViT
45
0
0
19 Mar 2025
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Hao Yin
Guangzong Si
Zilei Wang
46
0
0
17 Mar 2025
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing
Zilun Zhang
Haozhan Shen
Tiancheng Zhao
Bin Chen
Zian Guan
Yuhao Wang
Xu Jia
Yuxiang Cai
Yongheng Shang
Jianwei Yin
52
0
0
16 Mar 2025
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation
Yifan Xie
Binkai Ou
Fei Ma
Yaohua Liu
42
0
0
14 Mar 2025
The Power of One: A Single Example is All it Takes for Segmentation in VLMs
Mir Rayat Imtiaz Hossain
Mennatullah Siam
Leonid Sigal
James J. Little
MLLM
VLM
66
0
0
13 Mar 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning
Pengfei Luo
Jingbo Zhou
Tong Bill Xu
Yuan Xia
Linli Xu
Enhong Chen
LRM
62
0
0
13 Mar 2025
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Jiayu Jiang
Changxing Ding
Wentao Tan
Junhong Wang
Jin Tao
Xiangmin Xu
49
1
0
13 Mar 2025
Debiased Prompt Tuning in Vision-Language Model without Annotations
Chaoquan Jiang
Yunfan Yang
Rui Hu
Jitao Sang
VLM
57
0
0
11 Mar 2025
Aligning Text to Image in Diffusion Models is Easier Than You Think
Aligning Text to Image in Diffusion Models is Easier Than You Think
J. Lee
Byunghee Cha
Jeongsol Kim
Jong Chul Ye
52
0
0
11 Mar 2025
Text-RGBT Person Retrieval: Multilevel Global-Local Cross-Modal Alignment and A High-quality Benchmark
Yifei Deng
Zhengyu Chen
Ziheng Xu
Chenglong Li
Jin Tang
37
0
0
11 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang
Yue Song
Georgia Gkioxari
Pietro Perona
VLM
53
0
0
10 Mar 2025
Anatomy-Aware Conditional Image-Text Retrieval
Meng Zheng
Jiajin Zhang
Benjamin Planche
Zhongpai Gao
Terrence Chen
Ziyan Wu
MedIm
52
0
0
10 Mar 2025
Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency
Duy Phuong Nguyen
J. P. Muñoz
Tanya Roosta
Ali Jannesari
FedML
59
0
0
10 Mar 2025
CPAny: Couple With Any Encoder to Refer Multi-Object Tracking
Weize Li
Yunhao Du
Qixiang Yin
Zhicheng Zhao
Fei Su
Daqi Liu
59
0
0
10 Mar 2025
AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
Wenxin Ma
Xu Zhang
Qingsong Yao
Fenghe Tang
Chenxu Wu
Y. Li
Rui Yan
Zihang Jiang
S. Kevin Zhou
VLM
57
0
0
09 Mar 2025
ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis
ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis
Xukun Zhou
Fengxin Li
Ming Chen
Yan Zhou
Pengfei Wan
Di Zhang
Yeying Jin
Zhaoxin Fan
Hongyan Liu
Jun He
DiffM
VGen
43
0
0
09 Mar 2025
1234...222324
Next