ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.01852
  4. Cited By
LanguageBind: Extending Video-Language Pretraining to N-modality by
  Language-based Semantic Alignment

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

3 October 2023
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
HongFa Wang
Yatian Pang
Wenhao Jiang
Junwu Zhang
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
    VLM
    MLLM
ArXivPDFHTML

Papers citing "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment"

50 / 156 papers shown
Title
RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios
RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios
Liming Zheng
Feng Yan
Fanfan Liu
Chengjian Feng
Zhuoliang Kang
Lin Ma
38
2
0
09 Jul 2024
KeyVideoLLM: Towards Large-scale Video Keyframe Selection
KeyVideoLLM: Towards Large-scale Video Keyframe Selection
Hao Liang
Jiapeng Li
Tianyi Bai
Xijie Huang
Linzhuang Sun
Zhengren Wang
Conghui He
Bin Cui
Chong Chen
Wentao Zhang
VGen
27
7
0
03 Jul 2024
Joint-Dataset Learning and Cross-Consistent Regularization for
  Text-to-Motion Retrieval
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval
Nicola Messina
J. Sedmidubský
Fabrizio Falchi
Tomáš Rebok
33
0
0
02 Jul 2024
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Jr-Jen Chen
Yu-Chien Liao
Hsi-Che Lin
Yu-Chu Yu
Yen-Chun Chen
Yu-Chiang Frank Wang
32
10
0
27 Jun 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
M. Zhang
Tat-Seng Chua
Shuicheng Yan
AI4TS
34
37
0
27 Jun 2024
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding
  with Task Divide-and-Conquer
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Lu Zhang
Tiancheng Zhao
Heting Ying
Yibo Ma
Kyusong Lee
LLMAG
24
9
0
24 Jun 2024
Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific
  Training Tasks
Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks
Daniel Wen
Nafisa Hussain
90
0
0
24 Jun 2024
Sports Intelligence: Assessing the Sports Understanding Capabilities of
  Language Models through Question Answering from Text to Video
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video
Zhengbang Yang
Haotian Xia
Jingxi Li
Zezhi Chen
Zhuangdi Zhu
Weining Shen
ELM
LRM
35
1
0
21 Jun 2024
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via
  Multi-modal LLM
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM
Huaxin Zhang
Xiaohao Xu
Xiang Wang
Jialong Zuo
Chuchu Han
Xiaonan Huang
Changxin Gao
Yuehuan Wang
Nong Sang
45
16
0
18 Jun 2024
WildVision: Evaluating Vision-Language Models in the Wild with Human
  Preferences
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Yujie Lu
Dongfu Jiang
Wenhu Chen
William Yang Wang
Yejin Choi
Bill Yuchen Lin
VLM
43
26
0
16 Jun 2024
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal
  Model
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Lu Xu
Sijie Zhu
Chunyuan Li
Chia-Wen Kuo
Fan Chen
Xinyao Wang
Guang Chen
Dawei Du
Ye Yuan
Longyin Wen
30
4
0
15 Jun 2024
AVR: Synergizing Foundation Models for Audio-Visual Humor Detection
AVR: Synergizing Foundation Models for Audio-Visual Humor Detection
Sarthak Sharma
Orchid Chetia Phukan
Drishti Singh
Arun Balaji Buduru
Rajesh Sharma
31
0
0
15 Jun 2024
RU-AI: A Large Multimodal Dataset for Machine Generated Content
  Detection
RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
Liting Huang
Zhihao Zhang
Yiran Zhang
Xiyue Zhou
Shoujin Wang
NoLa
38
2
0
07 Jun 2024
Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric
  Multimodal Representation
Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation
Ning Cheng
Changhao Guan
Jing Gao
Weihao Wang
You Li
Fandong Meng
Jie Zhou
Bin Fang
Jinan Xu
Wenjuan Han
VLM
23
7
0
06 Jun 2024
Wings: Learning Multimodal LLMs without Text-only Forgetting
Wings: Learning Multimodal LLMs without Text-only Forgetting
Yi-Kai Zhang
Shiyin Lu
Yang Li
Yanqing Ma
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
VLM
33
6
0
05 Jun 2024
Artemis: Towards Referential Understanding in Complex Videos
Artemis: Towards Referential Understanding in Complex Videos
Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye
Yunjie Tian
VLM
VGen
37
8
0
01 Jun 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Ling-Hao Chen
Shunlin Lu
Ailing Zeng
Hao Zhang
Benyou Wang
Ruimao Zhang
Lei Zhang
45
34
0
30 May 2024
"Pass the butter": A study on desktop-classic multitasking robotic arm
  based on advanced YOLOv7 and BERT
"Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT
Haohua Que
Wenbin Pan
Jie Xu
Hao Luo
Pei Wang
Li Zhang
30
1
0
27 May 2024
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language
  Models via Role-playing Image Character
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character
Siyuan Ma
Weidi Luo
Yu Wang
Xiaogeng Liu
33
20
0
25 May 2024
OmniBind: Teach to Build Unequal-Scale Modality Interaction for
  Omni-Bind of All
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
Yuanhuiyi Lyu
Xueye Zheng
Dahun Kim
Lin Wang
32
10
0
25 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
67
41
0
23 May 2024
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Zhenwei Shao
Zhou Yu
Jun Yu
Xuecheng Ouyang
Lihao Zheng
Zhenbiao Gai
Mingyang Wang
Jiajun Ding
21
10
0
20 May 2024
Efficient Multimodal Large Language Models: A Survey
Efficient Multimodal Large Language Models: A Survey
Yizhang Jin
Jian Li
Yexin Liu
Tianjun Gu
Kai Wu
...
Xin Tan
Zhenye Gan
Yabiao Wang
Chengjie Wang
Lizhuang Ma
LRM
39
45
0
17 May 2024
Natural Language Can Help Bridge the Sim2Real Gap
Natural Language Can Help Bridge the Sim2Real Gap
Albert Yu
Adeline Foote
Raymond J. Mooney
Roberto Martín-Martín
LM&Ro
30
11
0
16 May 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Zehan Wang
Ziang Zhang
Xize Cheng
Rongjie Huang
Luping Liu
...
Haifeng Huang
Yang Zhao
Tao Jin
Peng Gao
Zhou Zhao
18
8
0
08 May 2024
WorldGPT: Empowering LLM as Multimodal World Model
WorldGPT: Empowering LLM as Multimodal World Model
Zhiqi Ge
Hongzhe Huang
Mingze Zhou
Juncheng Li
Guoming Wang
Siliang Tang
Yueting Zhuang
35
26
0
28 Apr 2024
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and
  Benchmark
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark
Zhaokun Zhou
Qiulin Wang
Bin Lin
Yiwei Su
R. J. Chen
Xin Tao
Amin Zheng
Li-xin Yuan
Pengfei Wan
Di Zhang
19
6
0
15 Apr 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
73
33
0
07 Apr 2024
WorDepth: Variational Language Prior for Monocular Depth Estimation
WorDepth: Variational Language Prior for Monocular Depth Estimation
Ziyao Zeng
Daniel Wang
Fengyu Yang
Hyoungseob Park
Yangchao Wu
Stefano Soatto
Byung-Woo Hong
Dong Lao
Alex Wong
MDE
38
26
0
04 Apr 2024
Direct Preference Optimization of Video Large Multimodal Models from
  Language Model Reward
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Ruohong Zhang
Liangke Gui
Zhiqing Sun
Yihao Feng
Keyang Xu
...
Di Fu
Chunyuan Li
Alexander G. Hauptmann
Yonatan Bisk
Yiming Yang
MLLM
43
57
0
01 Apr 2024
Open-Set Recognition in the Age of Vision-Language Models
Open-Set Recognition in the Age of Vision-Language Models
Dimity Miller
Niko Sünderhauf
Alex Kenna
Keita Mason
VLM
30
3
0
25 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
27
44
0
22 Mar 2024
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng
Richong Zhang
Junhao Zhang
Yanhan Ye
Zheyan Luo
Zhangchi Feng
Yongqiang Ma
30
360
0
20 Mar 2024
Reconstruct before Query: Continual Missing Modality Learning with
  Decomposed Prompt Collaboration
Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration
Shu Zhao
Xiaohan Zou
Tan Yu
Huijuan Xu
27
1
0
17 Mar 2024
Towards Comprehensive Multimodal Perception: Introducing the
  Touch-Language-Vision Dataset
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Ning Cheng
You Li
Jing Gao
Bin Fang
Jinan Xu
Wenjuan Han
39
4
0
14 Mar 2024
MolBind: Multimodal Alignment of Language, Molecules, and Proteins
MolBind: Multimodal Alignment of Language, Molecules, and Proteins
Teng Xiao
Chao Cui
Huaisheng Zhu
V. Honavar
AI4CE
32
6
0
13 Mar 2024
LLMBind: A Unified Modality-Task Integration Framework
LLMBind: A Unified Modality-Task Integration Framework
Bin Zhu
Munan Ning
Peng Jin
Bin Lin
Jinfa Huang
...
Junwu Zhang
Zhenyu Tang
Mingjun Pan
Xing Zhou
Li-ming Yuan
MLLM
32
6
0
22 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
27
29
0
20 Feb 2024
Model Composition for Multimodal Large Language Models
Model Composition for Multimodal Large Language Models
Chi Chen
Yiyang Du
Zheng Fang
Ziyue Wang
Fuwen Luo
...
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Janet Liu
MoMe
24
3
0
20 Feb 2024
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented
  In-Context Learning in Multi-Modal Large Language Model
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
Jianhao Yuan
Shuyang Sun
Daniel Omeiza
Bo-Lu Zhao
Paul Newman
Lars Kunze
Matthew Gadd
LRM
16
47
0
16 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
25
2
0
31 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
37
175
0
24 Jan 2024
When Large Language Model Agents Meet 6G Networks: Perception,
  Grounding, and Alignment
When Large Language Model Agents Meet 6G Networks: Perception, Grounding, and Alignment
Minrui Xu
Dusit Niyato
Jiawen Kang
Zehui Xiong
Shiwen Mao
Zhu Han
Dong In Kim
K. B. Letaief
LLMAG
31
31
0
15 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
50
81
0
29 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
156
918
0
21 Dec 2023
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language
  Fusion
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Xiao Wang
Jiandong Jin
Chenglong Li
Jin Tang
Cheng Zhang
Wei Wang
VLM
15
13
0
17 Dec 2023
FreestyleRet: Retrieving Images from Style-Diversified Queries
FreestyleRet: Retrieving Images from Style-Diversified Queries
Hao Li
Curise Jia
Peng Jin
Ze-Long Cheng
Kehan Li
Jialu Sui
Chang Liu
Li-ming Yuan
3DH
15
5
0
05 Dec 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
194
586
0
16 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
  with Image and Video Understanding
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
34
222
0
14 Nov 2023
mPLUG-Owl: Modularization Empowers Large Language Models with
  Multimodality
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLM
MLLM
206
899
0
27 Apr 2023
Previous
1234
Next