ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.11499
  4. Cited By
Wav2CLIP: Learning Robust Audio Representations From CLIP

Wav2CLIP: Learning Robust Audio Representations From CLIP

21 October 2021
Ho-Hsiang Wu
Prem Seetharaman
Kundan Kumar
J. P. Bello
    CLIP
    VLM
ArXivPDFHTML

Papers citing "Wav2CLIP: Learning Robust Audio Representations From CLIP"

50 / 189 papers shown
Title
Leveraging Pretrained Image-text Models for Improving Audio-Visual
  Learning
Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning
Saurabhchand Bhati
Jesús Villalba
Laureano Moro Velázquez
Thomas Thebaud
Najim Dehak
CLIP
25
3
0
08 Sep 2023
Generating Realistic Images from In-the-wild Sounds
Generating Realistic Images from In-the-wild Sounds
Taegyeong Lee
Jeonghun Kang
Hyeonyu Kim
Taehwan Kim
DiffM
24
1
0
05 Sep 2023
Learning Speech Representation From Contrastive Token-Acoustic
  Pretraining
Learning Speech Representation From Contrastive Token-Acoustic Pretraining
Chunyu Qiang
Hao Li
Yixin Tian
Ruibo Fu
Tao Wang
Longbiao Wang
J. Dang
15
5
0
01 Sep 2023
General Purpose Audio Effect Removal
General Purpose Audio Effect Removal
Matthew Rice
C. Steinmetz
Georgy Fazekas
Joshua D. Reiss
25
8
0
30 Aug 2023
Emotion-Aligned Contrastive Learning Between Images and Music
Emotion-Aligned Contrastive Learning Between Images and Music
Shanti Stewart
Kleanthis Avramidis
Tiantian Feng
Shrikanth Narayanan
19
0
0
24 Aug 2023
Music Understanding LLaMA: Advancing Text-to-Music Generation with
  Question Answering and Captioning
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
Shansong Liu
Atin Sakkeer Hussain
Chenshuo Sun
Yin Shan
MLLM
24
27
0
22 Aug 2023
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
  Connecting Foundation Models
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
Heng Wang
Jianbo Ma
Santiago Pascual
Richard Cartwright
Weidong (Tom) Cai
VGen
19
37
0
18 Aug 2023
Bridging High-Quality Audio and Video via Language for Sound Effects
  Retrieval from Visual Queries
Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries
J. Wilkins
Justin Salamon
Magdalena Fuentes
J. P. Bello
Oriol Nieto
CLIP
14
5
0
17 Aug 2023
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes
Zhaohui Li
Haitao Wang
Xinghua Jiang
29
1
0
14 Aug 2023
PromptPaint: Steering Text-to-Image Generation Through Paint Medium-like
  Interactions
PromptPaint: Steering Text-to-Image Generation Through Paint Medium-like Interactions
John Joon Young Chung
Eytan Adar
DiffM
25
56
0
09 Aug 2023
DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation
DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation
Qiaosong Qi
Le Zhuo
Aixi Zhang
Yue Liao
Fei Fang
Si Liu
Shuicheng Yan
11
22
0
05 Aug 2023
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using
  Beat-Synchronous Mixup Strategies
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies
K. Chen
Yusong Wu
Haohe Liu
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
DiffM
25
74
0
03 Aug 2023
UniBriVL: Robust Universal Representation and Generation of Audio Driven
  Diffusion Models
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
Sen Fang
Bowen Gao
Yangjian Wu
T. Teoh
DiffM
18
1
0
29 Jul 2023
SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic
  Spaces
SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces
Iván Vallés-Pérez
Grzegorz Beringer
Piotr Bilinski
G. Cook
Roberto Barra-Chicote
11
1
0
23 Jul 2023
Density-invariant Features for Distant Point Cloud Registration
Density-invariant Features for Distant Point Cloud Registration
Quanpan Liu
Hongzi Zhu
Yunsong Zhou
Hongyang Li
Shan Chang
Minyi Guo
3DPC
26
19
0
19 Jul 2023
REFLECT: Summarizing Robot Experiences for Failure Explanation and
  Correction
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction
Zeyi Liu
Arpit Bahety
Shuran Song
LRM
16
115
0
27 Jun 2023
A Multimodal Prototypical Approach for Unsupervised Sound Classification
A Multimodal Prototypical Approach for Unsupervised Sound Classification
Saksham Singh Kushwaha
Magdalena Fuentes
22
8
0
21 Jun 2023
NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNN
NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNN
Yu-Zhu Guo
Y. Chen
Zhe Ma
VLM
25
5
0
21 Jun 2023
Align, Adapt and Inject: Sound-guided Unified Image Generation
Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang
Kaipeng Zhang
Yuying Ge
Wenqi Shao
Zeyue Xue
Yu Qiao
Ping Luo
DiffM
16
5
0
20 Jun 2023
Visually-Guided Sound Source Separation with Audio-Visual Predictive
  Coding
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding
Zengjie Song
Zhaoxiang Zhang
19
1
0
19 Jun 2023
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
  Language-Vision Models
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models
Hao-Wen Dong
Xiaoyu Liu
Jordi Pons
Gautam Bhattacharya
Santiago Pascual
Joan Serra
Taylor Berg-Kirkpatrick
Julian McAuley
DiffM
22
19
0
16 Jun 2023
Language-Guided Music Recommendation for Video via Prompt Analogies
Language-Guided Music Recommendation for Video via Prompt Analogies
Daniel McKee
Justin Salamon
Josef Sivic
Bryan C. Russell
VGen
23
26
0
15 Jun 2023
Contrastive Learning-Based Audio to Lyrics Alignment for Multiple
  Languages
Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages
Simon Durand
Daniel Stoller
Sebastian Ewert
26
12
0
13 Jun 2023
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran
Yashin Dicente Cid
Amal Lahiani
Fabian J. Theis
Tingying Peng
Eldad Klaiman
13
2
0
23 May 2023
AudioToken: Adaptation of Text-Conditioned Diffusion Models for
  Audio-to-Image Generation
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
Guy Yariv
Itai Gat
Lior Wolf
Yossi Adi
Idan Schwartz
DiffM
20
20
0
22 May 2023
Connecting Multi-modal Contrastive Representations
Connecting Multi-modal Contrastive Representations
Zehan Wang
Yang Zhao
Xize Cheng
Haifeng Huang
Jiageng Liu
...
Lin Li
Yongqiang Wang
Aoxiong Yin
Ziang Zhang
Zhou Zhao
17
22
0
22 May 2023
LEAN: Light and Efficient Audio Classification Network
LEAN: Light and Efficient Audio Classification Network
Shwetank Choudhary
C. Karthik
Punuru Sri Lakshmi
Sumit Kumar
AI4TS
28
5
0
22 May 2023
Pengi: An Audio Language Model for Audio Tasks
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
30
156
0
19 May 2023
MIDI-Draw: Sketching to Control Melody Generation
MIDI-Draw: Sketching to Control Melody Generation
Tashi Namgyal
Peter A. Flach
Raúl Santos-Rodríguez
13
2
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
16
114
0
18 May 2023
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Zhepei Wang
Cem Subakan
Krishna Subramani
Junkai Wu
Tiago Tavares
Fabio Ayres
Paris Smaragdis
SSL
25
3
0
03 May 2023
A Portrait of Emotion: Empowering Self-Expression through AI-Generated
  Art
A Portrait of Emotion: Empowering Self-Expression through AI-Generated Art
Y. Lee
Yongha Park
S. Hahn
6
3
0
26 Apr 2023
A Comparative Study of Pre-trained Speech and Audio Embeddings for
  Speech Emotion Recognition
A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition
Orchid Chetia Phukan
Arun Balaji Buduru
Rajesh Sharma
28
6
0
22 Apr 2023
Improving Speech Translation by Cross-Modal Multi-Grained Contrastive
  Learning
Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
Hao Zhang
Nianwen Si
Yaqi Chen
Wenlin Zhang
Xukui Yang
Dan Qu
Weiqiang Zhang
25
9
0
20 Apr 2023
Soundini: Sound-Guided Diffusion for Natural Video Editing
Soundini: Sound-Guided Diffusion for Natural Video Editing
Seung Hyun Lee
Si-Yeol Kim
Innfarn Yoo
Feng Yang
Donghyeon Cho
Youngseo Kim
Huiwen Chang
Jinkyu Kim
Sangpil Kim
VGen
DiffM
27
15
0
13 Apr 2023
Pac-HuBERT: Self-Supervised Music Source Separation via Primitive
  Auditory Clustering and Hidden-Unit BERT
Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT
K. Chen
G. Wichern
Franccois G. Germain
Jonathan Le Roux
AI4TS
27
0
0
04 Apr 2023
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Kim Sung-Bin
Arda Senocak
H. Ha
Andrew Owens
Tae-Hyun Oh
DiffM
VGen
25
35
0
30 Mar 2023
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for
  Audio-Language Multimodal Research
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Xinhao Mei
Chutong Meng
Haohe Liu
Qiuqiang Kong
Tom Ko
Chengqi Zhao
Mark D. Plumbley
Yuexian Zou
Wenwu Wang
43
192
0
30 Mar 2023
Seeing What You Said: Talking Face Generation Guided by a Lip Reading
  Expert
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Jiadong Wang
Xinyuan Qian
Malu Zhang
R. Tan
Haizhou Li
EGVM
22
92
0
29 Mar 2023
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Reuben Tan
Arijit Ray
Andrea Burns
Bryan A. Plummer
Justin Salamon
Oriol Nieto
Bryan C. Russell
Kate Saenko
23
20
0
28 Mar 2023
Audio-Text Models Do Not Yet Leverage Natural Language
Audio-Text Models Do Not Yet Leverage Natural Language
Ho-Hsiang Wu
Oriol Nieto
J. P. Bello
Justin Salamon
VLM
11
28
0
19 Mar 2023
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet
  Tag-guided Synthetic Data
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Ke Zhu
CLIP
58
14
0
14 Mar 2023
Audio Visual Language Maps for Robot Navigation
Audio Visual Language Maps for Robot Navigation
Chen Huang
Oier Mees
Andy Zeng
Wolfram Burgard
VGen
60
32
0
13 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
16
10
0
12 Mar 2023
Exploring Efficient-Tuned Learning Audio Representation Method from
  BriVL
Exploring Efficient-Tuned Learning Audio Representation Method from BriVL
Sen Fang
Yang Wu
Bowen Gao
Jingwen Cai
T. Teoh
DiffM
16
1
0
08 Mar 2023
IPA-CLIP: Integrating Phonetic Priors into Vision and Language
  Pretraining
IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining
Chihaya Matsuhira
Marc A. Kastner
Takahiro Komamizu
Takatsugu Hirayama
Keisuke Doman
Yasutomo Kawanishi
Ichiro Ide
32
6
0
06 Mar 2023
Audio Retrieval for Multimodal Design Documents: A New Dataset and
  Algorithms
Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms
Prachi Singh
Srikrishna Karanam
Sumit Shekhar
VGen
14
0
0
28 Feb 2023
ConceptFusion: Open-set Multimodal 3D Mapping
ConceptFusion: Open-set Multimodal 3D Mapping
Krishna Murthy Jatavallabhula
Ali Kuwajerwala
Qiao Gu
Mohd. Omama
Tao Chen
...
Celso Miguel de Melo
Madhava Krishna
Liam Paull
Florian Shkurti
Antonio Torralba
14
230
0
14 Feb 2023
DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
Dongsheng Xu
Qingbao Huang
Shuang Feng
Yiru Cai
Feng Shuang
Yi Cai
ViT
VLM
12
1
0
03 Feb 2023
MusicLM: Generating Music From Text
MusicLM: Generating Music From Text
A. Agostinelli
Timo I. Denk
Zalan Borsos
Jesse Engel
Mauro Verzetti
...
Adam Roberts
Marco Tagliasacchi
Matthew Sharifi
Neil Zeghidour
Christian Frank
MGen
36
416
0
26 Jan 2023
Previous
1234
Next