ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.01852
  4. Cited By
LanguageBind: Extending Video-Language Pretraining to N-modality by
  Language-based Semantic Alignment

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

3 October 2023
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
HongFa Wang
Yatian Pang
Wenhao Jiang
Junwu Zhang
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
    VLM
    MLLM
ArXivPDFHTML

Papers citing "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment"

50 / 156 papers shown
Title
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
A. Schwing
Yuki Mitsufuji
VGen
126
12
0
19 Dec 2024
Do Language Models Understand Time?
Do Language Models Understand Time?
Xi Ding
Lei Wang
167
0
0
18 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
78
0
0
16 Dec 2024
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and
  Pruning
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong
Zhuoming Liu
Yin Li
Liwei Wang
82
2
0
04 Dec 2024
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
SungHeon Jeong
Hanning Chen
Sanggeon Yun
Suhyeon Cho
Wenjun Huang
Xiangjian Liu
Mohsen Imani
98
1
0
04 Dec 2024
Eyes on the Road: State-of-the-Art Video Question Answering Models
  Assessment for Traffic Monitoring Tasks
Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks
Joseph Raj Vishal
Divesh Basina
Aarya Choudhary
Bharatesh Chakravarthi
64
1
0
02 Dec 2024
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Zichun Liao
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VGen
90
5
0
02 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
98
2
0
01 Dec 2024
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin
Yunyang Ge
Xinhua Cheng
Zongjian Li
Bin Zhu
...
Zhang Pan
Xing Zhou
Shaoling Dong
Yonghong Tian
Li-xin Yuan
VLM
VGen
116
58
0
28 Nov 2024
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding
  with Superior Temporal Localization Ability
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
MLLM
71
13
0
27 Nov 2024
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
Zongjian Li
Bin Lin
Yang Ye
Liuhan Chen
Xinhua Cheng
Shenghai Yuan
Li-xin Yuan
VGen
DiffM
104
16
0
26 Nov 2024
Video-Text Dataset Construction from Multi-AI Feedback: Promoting
  Weak-to-Strong Preference Learning for Video Large Language Models
Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models
Hao Yi
Qingyang Li
Y. Hu
Fuzheng Zhang
Di Zhang
Yong Liu
VGen
67
0
0
25 Nov 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
70
0
0
24 Nov 2024
freePruner: A Training-free Approach for Large Multimodal Model
  Acceleration
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu
Yuzhang Shang
Yunhao Ge
Qian Lou
Yan Yan
97
3
0
23 Nov 2024
ReWind: Understanding Long Videos with Instructed Learnable Memory
ReWind: Understanding Long Videos with Instructed Learnable Memory
Anxhelo Diko
Tinghuai Wang
Wassim Swaileh
Shiyan Sun
Ioannis Patras
KELM
VLM
77
0
0
23 Nov 2024
Tiny-Align: Bridging Automatic Speech Recognition and Large Language
  Model on the Edge
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge
Ruiyang Qin
Dancheng Liu
Gelei Xu
Zheyu Yan
Chenhui Xu
Yuting Hu
Xiaolin Hu
Jinjun Xiong
Yiyu Shi
AuLLM
104
1
0
21 Nov 2024
Generative Emotion Cause Explanation in Multimodal Conversations
Generative Emotion Cause Explanation in Multimodal Conversations
Lin Wang
Xiaocui Yang
Shi Feng
Daling Wang
Yifei Zhang
34
0
0
01 Nov 2024
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Xize Cheng
Siqi Zheng
Zehan Wang
Minghui Fang
Ziang Zhang
...
Z. Ma
Shengpeng Ji
Jialong Zuo
Tao Jin
Zhou Zhao
17
1
0
28 Oct 2024
Are Visual-Language Models Effective in Action Recognition? A
  Comparative Study
Are Visual-Language Models Effective in Action Recognition? A Comparative Study
Mahmoud Ali
Di Yang
François Brémond
VLM
49
0
0
22 Oct 2024
VidCompress: Memory-Enhanced Temporal Compression for Video
  Understanding in Large Language Models
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
19
2
0
15 Oct 2024
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
Reno Kriz
Kate Sanders
David Etter
Kenton W. Murray
Cameron Carpenter
...
Alexander Martin
Ronald Colaianni
Nolan King
Eugene Yang
Benjamin Van Durme
VGen
33
2
0
15 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
When Does Perceptual Alignment Benefit Vision Representations?
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
49
6
1
14 Oct 2024
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Lianyu Hu
Tongkai Shi
Wei Feng
Fanhua Shang
Liang Wan
VLM
29
1
0
09 Oct 2024
Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
Mehdi Azarafza
Fatima Idrees
Ali Ehteshami Bejnordi
Charles Steinmetz
Stefan Henkler
A. Rettberg
23
0
0
07 Oct 2024
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Javier Marin
LRM
72
0
0
06 Oct 2024
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
  Videos
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Jianrui Zhang
Mu Cai
Yong Jae Lee
32
6
0
03 Oct 2024
Video Instruction Tuning With Synthetic Data
Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
Wei Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDa
VGen
39
136
0
03 Oct 2024
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Minoh Jeong
Min Namgung
Zae Myung Kim
Dongyeop Kang
Yao-Yi Chiang
Alfred Hero
23
0
0
02 Oct 2024
Efficient Backdoor Defense in Multimodal Contrastive Learning: A
  Token-Level Unlearning Method for Mitigating Threats
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats
Kuanrong Liu
Siyuan Liang
Jiawei Liang
Pengwen Dai
Xiaochun Cao
MU
AAML
31
1
0
29 Sep 2024
From Vision to Audio and Beyond: A Unified Model for Audio-Visual
  Representation and Generation
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Kun Su
Xiulong Liu
Eli Shlizerman
VGen
28
6
0
27 Sep 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Ye Liu
Zongyang Ma
Zhongang Qi
Yang Wu
Ying Shan
Chang Wen Chen
31
16
0
26 Sep 2024
CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent
  State Representation
CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation
Fuxian Huang
Qi Zhang
Shaopeng Zhai
Jie Wang
Tianyi Zhang
Haoran Zhang
Ming Zhou
Yu Liu
Yu Qiao
CLIP
AI4TS
34
0
0
24 Sep 2024
Representation Loss Minimization with Randomized Selection Strategy for
  Efficient Environmental Fake Audio Detection
Representation Loss Minimization with Randomized Selection Strategy for Efficient Environmental Fake Audio Detection
Orchid Chetia Phukan
Girish
Mohd Mujtaba Akhtar
Swarup Ranjan Behera
Nitin Choudhury
Arun Balaji Buduru
Rajesh Sharma
S. R Mahadeva Prasanna
18
0
0
24 Sep 2024
Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation
  Models with Optimal Transport for Non-Verbal Emotion Recognition
Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition
Orchid Chetia Phukan
Mohd Mujtaba Akhtar
Girish
Swarup Ranjan Behera
Sishir Kalita
Arun Balaji Buduru
Rajesh Sharma
S. R Mahadeva Prasanna
EgoV
26
0
0
21 Sep 2024
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free
  Manner
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
Yuzhang Shang
Bingxin Xu
Weitai Kang
Mu Cai
Yuheng Li
Zehao Wen
Zhen Dong
Kurt Keutzer
Yong Jae Lee
Yan Yan
33
7
0
19 Sep 2024
Designing Interfaces for Multimodal Vector Search Applications
Designing Interfaces for Multimodal Vector Search Applications
Owen Pendrigh Elliott
Tom Hamer
Jesse Clark
46
0
0
18 Sep 2024
One missing piece in Vision and Language: A Survey on Comics Understanding
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli
Andrey Barsky
Mohamed Ali Souibgui
Artemis LLabres
Marco Bertini
Dimosthenis Karatzas
34
3
0
14 Sep 2024
Multimodal Fusion with LLMs for Engagement Prediction in Natural
  Conversation
Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation
Cheng Charles Ma
Kevin Hyekang Joo
Alexandria K. Vail
Sunreeta Bhattacharya
Álvaro Fernández García
Kailana Baker-Matsuoka
Sheryl Mathew
Lori L. Holt
Fernando De la Torre
47
3
0
13 Sep 2024
Unleashing the Temporal-Spatial Reasoning Capacity of GPT for
  Training-Free Audio and Language Referenced Video Object Segmentation
Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
Shaofei Huang
Rui Ling
Hongyu Li
Tianrui Hui
Zongheng Tang
Xiaoming Wei
Jizhong Han
Si Liu
VOS
29
4
0
28 Aug 2024
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction
  Tuning
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning
Bohao Xing
Zitong Yu
Xin Liu
Kaishen Yuan
Qilang Ye
Weicheng Xie
Huanjing Yue
Jingyu Yang
H. Kalviainen
48
10
0
21 Aug 2024
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for
  Speech Processing
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
Chunyu Qiang
Wang Geng
Yi Zhao
Ruibo Fu
Tao Wang
...
Chen Zhang
Hao Che
Longbiao Wang
Jianwu Dang
Jianhua Tao
AI4TS
36
0
0
11 Aug 2024
VideoQA in the Era of LLMs: An Empirical Study
VideoQA in the Era of LLMs: An Empirical Study
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
23
10
0
08 Aug 2024
Leveraging Foundation Models for Zero-Shot IoT Sensing
Leveraging Foundation Models for Zero-Shot IoT Sensing
Dinghao Xue
Xiaoran Fan
Tao Chen
Guohao Lan
Qun Song
AI4CE
33
0
0
29 Jul 2024
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video
  Retrieval
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Thomas Hummel
Shyamgopal Karthik
Mariana-Iuliana Georgescu
Zeynep Akata
EgoV
34
4
0
23 Jul 2024
Rethinking Video-Text Understanding: Retrieval from Counterfactually
  Augmented Data
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Wufei Ma
Kai Li
Zhongshi Jiang
Moustafa Meshry
Qihao Liu
Huiyu Wang
Christian Hane
Alan L. Yuille
VGen
24
1
0
18 Jul 2024
MERLIN: Multimodal Embedding Refinement via LLM-based Iterative
  Navigation for Text-Video Retrieval-Rerank Pipeline
MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline
D. Han
Eunhwan Park
Gisang Lee
Adam Lee
Nojun Kwak
35
2
0
17 Jul 2024
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang
Ziang Zhang
Hang Zhang
Luping Liu
Rongjie Huang
Xize Cheng
Hengshuang Zhao
Zhou Zhao
30
9
0
16 Jul 2024
Learning Modality-agnostic Representation for Semantic Segmentation from
  Any Modalities
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
Xueye Zheng
Yuanhuiyi Lyu
Lin Wang
VLM
47
10
0
16 Jul 2024
When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model
  and Benchmark Dataset
When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
Yi Zhang
Wang Zeng
Sheng Jin
Chao Qian
Ping Luo
Wentao Liu
32
4
0
14 Jul 2024
Video-to-Audio Generation with Hidden Alignment
Video-to-Audio Generation with Hidden Alignment
Manjie Xu
Chenxing Li
Yong Ren
Rilin Chen
Yu Gu
Yu Gu
Dong Yu
Dong Yu
DiffM
VGen
43
11
0
10 Jul 2024
Previous
1234
Next