Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.05629
Cited By
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
9 June 2024
Mark Hamilton
Andrew Zisserman
John R. Hershey
William T. Freeman
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language"
4 / 4 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
27
0
0
02 May 2025
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Yi-Jen Shih
Hsuan-Fu Wang
Heng-Jui Chang
Layne Berry
Hung-yi Lee
David F. Harwath
VLM
CLIP
38
32
0
03 Oct 2022
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Shentong Mo
Pedro Morgado
69
64
0
30 Aug 2022
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
283
5,723
0
29 Apr 2021
1