ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1610.09001
  4. Cited By
SoundNet: Learning Sound Representations from Unlabeled Video

SoundNet: Learning Sound Representations from Unlabeled Video

27 October 2016
Y. Aytar
Carl Vondrick
Antonio Torralba
    SSL
ArXivPDFHTML

Papers citing "SoundNet: Learning Sound Representations from Unlabeled Video"

50 / 120 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
32
0
0
02 May 2025
Improving Sound Source Localization with Joint Slot Attention on Image and Audio
Improving Sound Source Localization with Joint Slot Attention on Image and Audio
Inho Kim
Youngkil Song
Jicheol Park
Won Hwa Kim
Suha Kwak
22
0
0
21 Apr 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
79
2
0
10 Jan 2025
Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey)
Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey)
S. Oota
Zijiao Chen
Manish Gupta
R. Bapi
G. Jobard
F. Alexandre
X. Hinaut
3DV
AI4CE
44
11
0
31 Dec 2024
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation
Salar Abbaspourazad
Anshuman Mishra
Joseph D. Futoma
Andrew C. Miller
Ian Shapiro
83
0
0
15 Dec 2024
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Zichun Liao
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VGen
90
5
0
02 Dec 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
40
0
0
18 Nov 2024
MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation
MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation
T. Pham
Tri Ton
Chang D. Yoo
36
3
0
03 Oct 2024
AudioRepInceptionNeXt: A lightweight single-stream architecture for
  efficient audio recognition
AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
Kin Wai Lau
Yasar Abbas Ur Rehman
L. Po
33
1
0
21 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
Transformer-based Video Saliency Prediction with High Temporal Dimension
  Decoding
Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding
Morteza Moradi
S. Palazzo
C. Spampinato
24
2
0
15 Jan 2024
Formal Verification of Long Short-Term Memory based Audio Classifiers: A
  Star based Approach
Formal Verification of Long Short-Term Memory based Audio Classifiers: A Star based Approach
Neelanjana Pal
Taylor T. Johnson
6
0
0
16 Nov 2023
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
Jinzheng Zhao
Yong-mei Xu
Xinyuan Qian
Davide Berghi
Peipei Wu
Meng Cui
Jianyuan Sun
Philip J. B. Jackson
Wenwu Wang
BDL
37
7
0
23 Oct 2023
Sound Source Localization is All about Cross-Modal Alignment
Sound Source Localization is All about Cross-Modal Alignment
Arda Senocak
H. Ryu
Junsik Kim
Tae-Hyun Oh
Hanspeter Pfister
Joon Son Chung
19
18
0
19 Sep 2023
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
  Non-Photorealistic Videos
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos
Ziyuan Yang
Sucheng Ren
Zongwei Wu
Nanxuan Zhao
Junle Wang
Jing Qin
Shengfeng He
22
2
0
23 Aug 2023
Enhancing the Prediction of Emotional Experience in Movies using Deep
  Neural Networks: The Significance of Audio and Language
Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language
Sogand Mohammadi
M. G. Orimi
Hamid R. Rabiee
19
0
0
17 Jun 2023
Video-to-Music Recommendation using Temporal Alignment of Segments
Video-to-Music Recommendation using Temporal Alignment of Segments
Laure Prétet
G. Richard
Clement Souchier
Geoffroy Peeters
AI4TS
23
13
0
12 Jun 2023
Assessing Language Disorders using Artificial Intelligence: a Paradigm
  Shift
Assessing Language Disorders using Artificial Intelligence: a Paradigm Shift
C. Themistocleous
K. Tsapkini
Dimitrios Kokkinakis
8
0
0
31 May 2023
Transavs: End-To-End Audio-Visual Segmentation With Transformer
Transavs: End-To-End Audio-Visual Segmentation With Transformer
Yuhang Ling
Yuxi Li
Zhenye Gan
Jiangning Zhang
M. Chi
Yabiao Wang
VOS
ViT
29
1
0
12 May 2023
Robust Cross-Modal Knowledge Distillation for Unconstrained Videos
Robust Cross-Modal Knowledge Distillation for Unconstrained Videos
Wenke Xia
Xingjian Li
Andong Deng
Haoyi Xiong
Dejing Dou
Di Hu
11
4
0
16 Apr 2023
Looking Similar, Sounding Different: Leveraging Counterfactual
  Cross-Modal Pairs for Audiovisual Representation Learning
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
Nikhil Singh
Chih-Wei Wu
Iroro Orife
Mahdi M. Kalayeh
23
2
0
12 Apr 2023
Speaker Recognition in Realistic Scenario Using Multimodal Data
Speaker Recognition in Realistic Scenario Using Multimodal Data
Saqlain Hussain Shah
M. S. Saeed
Shah Nawaz
Muhammad Haroon Yousaf
CVBM
11
8
0
25 Feb 2023
Detection and classification of vocal productions in large scale audio
  recordings
Detection and classification of vocal productions in large scale audio recordings
Guillem Bonafos
Pierre Pudlo
Jean-Marc Freyermuth
T. Legou
J. Fagot
Samuel Tronccon
Arnaud Rey
AI4TS
11
1
0
14 Feb 2023
Motion and Context-Aware Audio-Visual Conditioned Video Prediction
Motion and Context-Aware Audio-Visual Conditioned Video Prediction
Yating Xu
Conghui Hu
G. Lee
VGen
35
0
0
09 Dec 2022
Effective Audio Classification Network Based on Paired Inverse Pyramid
  Structure and Dense MLP Block
Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block
Yunhao Chen
Yunjie Zhu
Zihui Yan
Yifan Huang
Zhen Ren
Jianlu Shen
Lifang Chen
20
9
0
05 Nov 2022
Contrastive Audio-Visual Masked Autoencoder
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong
Andrew Rouditchenko
Alexander H. Liu
David F. Harwath
Leonid Karlinsky
Hilde Kuehne
James R. Glass
24
119
0
02 Oct 2022
StyleTime: Style Transfer for Synthetic Time Series Generation
StyleTime: Style Transfer for Synthetic Time Series Generation
Yousef El-Laham
Svitlana Vyetrenko
AI4TS
21
5
0
22 Sep 2022
ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining
ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining
Zhexiong Liu
M. Guo
Y. Dai
Diane Litman
16
15
0
14 Sep 2022
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Shentong Mo
Pedro Morgado
79
64
0
30 Aug 2022
Survey: Exploiting Data Redundancy for Optimization of Deep Learning
Survey: Exploiting Data Redundancy for Optimization of Deep Learning
Jou-An Chen
Wei Niu
Bin Ren
Yanzhi Wang
Xipeng Shen
21
24
0
29 Aug 2022
UAVM: Towards Unifying Audio and Visual Models
UAVM: Towards Unifying Audio and Visual Models
Yuan Gong
Alexander H. Liu
Andrew Rouditchenko
James R. Glass
25
20
0
29 Jul 2022
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Grant Van Horn
Rui Qian
Kimberly Wilber
Hartwig Adam
Oisin Mac Aodha
Serge J. Belongie
19
10
0
21 Jul 2022
Audio-Visual Segmentation
Audio-Visual Segmentation
Jinxing Zhou
Jianyuan Wang
J. Zhang
Weixuan Sun
Jing Zhang
Stan Birchfield
Dan Guo
Lingpeng Kong
Meng Wang
Yiran Zhong
VOS
28
110
0
11 Jul 2022
Visual-Assisted Sound Source Depth Estimation in the Wild
Visual-Assisted Sound Source Depth Estimation in the Wild
Wei Sun
L. Qiu
MDE
11
0
0
07 Jul 2022
Federated Self-supervised Learning for Video Understanding
Federated Self-supervised Learning for Video Understanding
Yasar Abbas Ur Rehman
Yan Gao
Jiajun Shen
Pedro Porto Buarque de Gusmão
Nicholas D. Lane
FedML
15
15
0
05 Jul 2022
Feature Pyramid Attention based Residual Neural Network for
  Environmental Sound Classification
Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification
Liguang Zhou
Yuhongze Zhou
Xiaonan Qi
Junjie Hu
Tin Lun Lam
Yangsheng Xu
31
5
0
28 May 2022
Urban Rhapsody: Large-scale exploration of urban soundscapes
Urban Rhapsody: Large-scale exploration of urban soundscapes
Joao Rulff
Fabio Miranda
Maryam Hosseini
Marcos Lage
M. Cartwright
Graham Dove
J. P. Bello
Claudio T. Silva
14
7
0
25 May 2022
Weakly-Supervised Action Detection Guided by Audio Narration
Weakly-Supervised Action Detection Guided by Audio Narration
Keren Ye
Adriana Kovashka
22
0
0
12 May 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
31
39
0
06 Apr 2022
Learning Neural Acoustic Fields
Learning Neural Acoustic Fields
Andrew F. Luo
Yilun Du
Michael J. Tarr
J. Tenenbaum
Antonio Torralba
Chuang Gan
AI4CE
20
76
0
04 Apr 2022
1-D CNN based Acoustic Scene Classification via Reducing Layer-wise
  Dimensionality
1-D CNN based Acoustic Scene Classification via Reducing Layer-wise Dimensionality
Arshdeep Singh
17
1
0
31 Mar 2022
Multitask Emotion Recognition Model with Knowledge Distillation and Task
  Discriminator
Multitask Emotion Recognition Model with Knowledge Distillation and Task Discriminator
Euiseok Jeong
Geesung Oh
Sejoon Lim
CVBM
17
7
0
24 Mar 2022
Automated detection of foreground speech with wearable sensing in
  everyday home environments: A transfer learning approach
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach
Dawei Liang
Zifan Xu
Yinuo Chen
Rebecca Adaimi
David F. Harwath
Edison Thomaz
40
1
0
21 Mar 2022
Learning Audio Representations with MLPs
Learning Audio Representations with MLPs
Mashrur M. Morshed
Ahmad Omar Ahsan
H. Mahmud
Md. Kamrul Hasan
19
4
0
16 Mar 2022
Audio Self-supervised Learning: A Survey
Audio Self-supervised Learning: A Survey
Shuo Liu
Adria Mallol-Ragolta
Emilia Parada-Cabeleiro
Kun Qian
Xingshuo Jing
Alexander Kathan
Bin Hu
Bjoern W. Schuller
SSL
22
106
0
02 Mar 2022
Real-time Emergency Vehicle Event Detection Using Audio Data
Real-time Emergency Vehicle Event Detection Using Audio Data
Zubayer Islam
Mohamed Abdel-Aty
9
5
0
03 Feb 2022
Keyword localisation in untranscribed speech using visually grounded
  speech models
Keyword localisation in untranscribed speech using visually grounded speech models
Kayode Olaleye
Dan Oneaţă
Herman Kamper
19
7
0
02 Feb 2022
Sound and Visual Representation Learning with Multiple Pretraining Tasks
Sound and Visual Representation Learning with Multiple Pretraining Tasks
A. Vasudevan
Dengxin Dai
Luc Van Gool
SSL
25
6
0
04 Jan 2022
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
23
129
0
08 Dec 2021
Health Monitoring of Industrial machines using Scene-Aware Threshold
  Selection
Health Monitoring of Industrial machines using Scene-Aware Threshold Selection
Arshdeep Singh
R. Arvind
Padmanabhan Rajan
11
1
0
21 Nov 2021
123
Next