ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1706.04261
  4. Cited By
The "something something" video database for learning and evaluating
  visual common sense
v1v2 (latest)

The "something something" video database for learning and evaluating visual common sense

IEEE International Conference on Computer Vision (ICCV), 2017
13 June 2017
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
Heuna Kim
V. Haenel
Ingo Fründ
P. Yianilos
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
    VLM
ArXiv (abs)PDFHTML

Papers citing "The "something something" video database for learning and evaluating visual common sense"

50 / 1,013 papers shown
Side4Video: Spatial-Temporal Side Network for Memory-Efficient
  Image-to-Video Transfer Learning
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Huanjin Yao
Wenhao Wu
Zhiheng Li
VLM
303
13
0
27 Nov 2023
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
Wenhao Wu
Huanjin Yao
Mengxi Zhang
Yuxin Song
Wanli Ouyang
Jingdong Wang
VLM
359
38
0
27 Nov 2023
Align before Adapt: Leveraging Entity-to-Region Alignments for
  Generalizable Video Action Recognition
Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action RecognitionComputer Vision and Pattern Recognition (CVPR), 2023
Yifei Chen
Dapeng Chen
Ruijin Liu
Sai Zhou
Wenyuan Xue
Wei Peng
287
15
0
27 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
267
4
0
25 Nov 2023
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision
  Language Models in Open-Ended Video Question Answering
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023
Xiuyuan Chen
Yuan Lin
Yuchen Zhang
Weiran Huang
ELMMLLM
307
38
0
25 Nov 2023
Input Compression with Positional Consistency for Efficient Training and
  Inference of Transformer Neural Networks
Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks
Amrit Nagarajan
Anand Raghunathan
VLMViT
64
0
0
22 Nov 2023
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human
  Demonstration
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human DemonstrationIEEE Robotics and Automation Letters (RA-L), 2023
Naoki Wake
Atsushi Kanehira
Kazuhiro Sasabuchi
Jun Takamatsu
Katsushi Ikeuchi
LM&Ro
329
100
0
20 Nov 2023
VideoCon: Robust Video-Language Alignment via Contrast Captions
VideoCon: Robust Video-Language Alignment via Contrast CaptionsComputer Vision and Pattern Recognition (CVPR), 2023
Hritik Bansal
Yonatan Bitton
Idan Szpektor
Kai-Wei Chang
Aditya Grover
137
28
0
15 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
  Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language ModelsInternational Conference on Learning Representations (ICLR), 2023
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
276
20
0
13 Nov 2023
Learning Human Action Recognition Representations Without Real Humans
Learning Human Action Recognition Representations Without Real HumansNeural Information Processing Systems (NeurIPS), 2023
Howard Zhong
Samarth Mishra
Donghyun Kim
SouYoung Jin
Yikang Shen
Hildegard Kuehne
Leonid Karlinsky
Venkatesh Saligrama
Aude Oliva
Rogerio Feris
276
6
0
10 Nov 2023
Semantic-aware Video Representation for Few-shot Action Recognition
Semantic-aware Video Representation for Few-shot Action RecognitionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Yutao Tang
Benjamin Bejar
René Vidal
293
14
0
10 Nov 2023
Automated Sperm Assessment Framework and Neural Network Specialized for
  Sperm Video Recognition
Automated Sperm Assessment Framework and Neural Network Specialized for Sperm Video RecognitionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
T. Fujii
Hayato Nakagawa
T. Takeshima
Y. Yumura
T. Hamagami
124
7
0
10 Nov 2023
OmniVec: Learning robust representations with cross modal sharing
OmniVec: Learning robust representations with cross modal sharingIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Siddharth Srivastava
Gaurav Sharma
SSL
288
83
0
07 Nov 2023
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
Asymmetric Masked Distillation for Pre-Training Small Foundation ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Zhiyu Zhao
Bingkun Huang
Sen Xing
Gangshan Wu
Yu Qiao
Limin Wang
203
12
0
06 Nov 2023
What Makes Pre-Trained Visual Representations Successful for Robust
  Manipulation?
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?Conference on Robot Learning (CoRL), 2023
Kaylee Burns
Zach Witzel
Jubayer Ibn Hamid
Tianhe Yu
Chelsea Finn
Karol Hausman
OODSSL
374
34
0
03 Nov 2023
On Hand-Held Grippers and the Morphological Gap in Human Manipulation
  Demonstration
On Hand-Held Grippers and the Morphological Gap in Human Manipulation Demonstration
Kiran Doshi
Yijiang Huang
Stelian Coros
156
7
0
03 Nov 2023
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology LabNeural Information Processing Systems (NeurIPS), 2023
Jieming Cui
Ziren Gong
Baoxiong Jia
Siyuan Huang
Zilong Zheng
Jianzhu Ma
Yixin Zhu
212
4
0
01 Nov 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision)
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Qinghong Lin
Faisal Ahmed
Linjie Li
Chung-Ching Lin
E. Azarnasab
...
Lin Liang
Zicheng Liu
Yumao Lu
Ce Liu
Lijuan Wang
MLLM
232
84
0
30 Oct 2023
Videoprompter: an ensemble of foundational models for zero-shot video
  understanding
Videoprompter: an ensemble of foundational models for zero-shot video understanding
Adeel Yousaf
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
Mubarak Shah
VLM
206
3
0
23 Oct 2023
S3Aug: Segmentation, Sampling, and Shift for Action Recognition
S3Aug: Segmentation, Sampling, and Shift for Action Recognition
Taiki Sugiura
Toru Tamaki
AI4TS
215
6
0
23 Oct 2023
Frozen Transformers in Language Models Are Effective Visual Encoder
  Layers
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang
Ziyang Xie
Yunze Man
Yu-Xiong Wang
430
47
0
19 Oct 2023
A Survey on Video Diffusion Models
A Survey on Video Diffusion ModelsACM Computing Surveys (ACM Comput. Surv.), 2023
Zhen Xing
Qijun Feng
Haoran Chen
Jingdong Sun
Hang-Rui Hu
Hang Xu
Zuxuan Wu
Yu-Gang Jiang
EGVMVGen
439
219
0
16 Oct 2023
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion
  Models
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion ModelsInternational Conference on Learning Representations (ICLR), 2023
Kevin Black
Mitsuhiko Nakamoto
P. Atreya
Homer Walke
Chelsea Finn
Aviral Kumar
Sergey Levine
DiffMLM&Ro
388
235
0
16 Oct 2023
Few-shot Action Recognition with Captioning Foundation Models
Few-shot Action Recognition with Captioning Foundation Models
Xiang Wang
Shiwei Zhang
Hangjie Yuan
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
VLM
334
9
0
16 Oct 2023
Watt For What: Rethinking Deep Learning's Energy-Performance
  Relationship
Watt For What: Rethinking Deep Learning's Energy-Performance Relationship
Shreyank N. Gowda
Xinyue Hao
Gen Li
Laura Sevilla-Lara
Shashank Narayana Gowda
HAI
183
17
0
10 Oct 2023
Learning Interactive Real-World Simulators
Learning Interactive Real-World SimulatorsInternational Conference on Learning Representations (ICLR), 2023
Mengjiao Yang
Yilun Du
Kamyar Ghasemipour
Jonathan Tompson
Leslie Kaelbling
Dale Schuurmans
Pieter Abbeel
LM&RoPINN
345
330
0
09 Oct 2023
DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
DyST: Towards Dynamic Neural Scene Representations on Real-World VideosInternational Conference on Learning Representations (ICLR), 2023
Maximilian Seitzer
Sjoerd van Steenkiste
Thomas Kipf
Klaus Greff
Mehdi S. M. Sajjadi
VGenViT
347
11
0
09 Oct 2023
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu
José Lezama
N. B. Gundavarapu
Luca Versari
Kihyuk Sohn
...
Boqing Gong
Ming-Hsuan Yang
Irfan Essa
David A. Ross
Lu Jiang
435
517
0
09 Oct 2023
Building an Open-Vocabulary Video CLIP Model with Better Architectures,
  Optimization and Data
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zuxuan Wu
Zejia Weng
Wujian Peng
Xitong Yang
Ang Li
Larry S. Davis
Yu-Gang Jiang
CLIPVLM
243
29
0
08 Oct 2023
Human-oriented Representation Learning for Robotic Manipulation
Human-oriented Representation Learning for Robotic Manipulation
Mingxiao Huo
Mingyu Ding
Chenfeng Xu
Thomas Tian
Xinghao Zhu
Yao Mu
Lingfeng Sun
Masayoshi Tomizuka
Wei Zhan
SSL
267
13
0
04 Oct 2023
Multiple Physics Pretraining for Physical Surrogate Models
Multiple Physics Pretraining for Physical Surrogate Models
Michael McCabe
Bruno Régaldo-Saint Blancard
Liam Parker
Ruben Ohana
M. Cranmer
...
Francois Lanusse
Mariel Pettee
Tiberiu Teşileanu
Kyunghyun Cho
Shirley Ho
PINNAI4CE
293
83
0
04 Oct 2023
A Grammatical Compositional Model for Video Action Detection
A Grammatical Compositional Model for Video Action Detection
Zhijun Zhang
Xu Zou
Jiahuan Zhou
Sheng Zhong
Ying Wu
249
0
0
04 Oct 2023
How Physics and Background Attributes Impact Video Transformers in
  Robotic Manipulation: A Case Study on Planar Pushing
How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar PushingIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023
Shutong Jin
Ruiyu Wang
Muhammad Zahid
Florian T. Pokorny
412
2
0
03 Oct 2023
Beyond the Benchmark: Detecting Diverse Anomalies in Videos
Beyond the Benchmark: Detecting Diverse Anomalies in Videos
Yoav Arad
Michael Werman
174
3
0
03 Oct 2023
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to
  Video
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to VideoEuropean Conference on Computer Vision (ECCV), 2023
Xinhao Li
Yuhan Zhu
Limin Wang
VLM
324
17
0
02 Oct 2023
A Hierarchical Graph-based Approach for Recognition and Description
  Generation of Bimanual Actions in Videos
A Hierarchical Graph-based Approach for Recognition and Description Generation of Bimanual Actions in Videos
Fatemeh Ziaeetabar
Reza Safabakhsh
S. Momtazi
M. Tamosiunaite
Florentin Wörgötter
260
7
0
01 Oct 2023
ConSOR: A Context-Aware Semantic Object Rearrangement Framework for
  Partially Arranged Scenes
ConSOR: A Context-Aware Semantic Object Rearrangement Framework for Partially Arranged ScenesIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023
Kartik Ramachandruni
Max Zuo
Sonia Chernova
260
8
0
30 Sep 2023
Egocentric RGB+Depth Action Recognition in Industry-Like Settings
Egocentric RGB+Depth Action Recognition in Industry-Like Settings
Jyoti Kini
Sarah Fleischer
I. Dave
Mubarak Shah
EgoV
266
5
0
25 Sep 2023
SkeleTR: Towrads Skeleton-based Action Recognition in the Wild
SkeleTR: Towrads Skeleton-based Action Recognition in the Wild
Haodong Duan
Mingze Xu
Bing Shuai
Davide Modolo
Zhuowen Tu
Joseph Tighe
Alessandro Bergamo
ViT
245
1
0
20 Sep 2023
Unsupervised Open-Vocabulary Object Localization in Videos
Unsupervised Open-Vocabulary Object Localization in VideosIEEE International Conference on Computer Vision (ICCV), 2023
Ke Fan
Zechen Bai
Tianjun Xiao
Dominik Zietlow
Max Horn
...
Bernt Schiele
Thomas Brox
Zheng Zhang
Yanwei Fu
Tong He
285
13
0
18 Sep 2023
Selective Volume Mixup for Video Action Recognition
Selective Volume Mixup for Video Action Recognition
Yi Tan
Zhaofan Qiu
Y. Hao
Ting Yao
Xiangnan He
Tao Mei
ViT
212
4
0
18 Sep 2023
FrameRS: A Video Frame Compression Model Composed by Self supervised
  Video Frame Reconstructor and Key Frame Selector
FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector
Qiqian Fu
Guanhong Wang
Gaoang Wang
93
0
0
16 Sep 2023
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
  Transfer Learning
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer LearningIEEE International Conference on Computer Vision (ICCV), 2023
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
216
31
0
14 Sep 2023
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Palaash Agrawal
Haidi Azaman
Cheston Tan
506
3
0
13 Sep 2023
CDFSL-V: Cross-Domain Few-Shot Learning for Videos
CDFSL-V: Cross-Domain Few-Shot Learning for VideosIEEE International Conference on Computer Vision (ICCV), 2023
Sarinda Samarasinghe
Mamshad Nayeem Rizve
Navid Kardan
M. Shah
310
13
0
07 Sep 2023
EgoPCA: A New Framework for Egocentric Hand-Object Interaction
  Understanding
EgoPCA: A New Framework for Egocentric Hand-Object Interaction UnderstandingIEEE International Conference on Computer Vision (ICCV), 2023
Yue Xu
Yong-Lu Li
Zhemin Huang
Michael Xu Liu
Cewu Lu
Yu-Wing Tai
Chi-Keung Tang
EgoV
175
12
0
05 Sep 2023
Hierarchical Masked 3D Diffusion Model for Video Outpainting
Hierarchical Masked 3D Diffusion Model for Video OutpaintingACM Multimedia (ACM MM), 2023
Fanda Fan
Chaoxu Guo
Litong Gong
Biao Wang
Bo Xiao
Yuning Jiang
Chunjie Luo
Jianfeng Zhan
DiffMVGen
256
24
0
05 Sep 2023
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning
  Based on Visually Grounded Conversations
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded ConversationsEuropean Conference on Computer Vision (ECCV), 2023
Kilichbek Haydarov
Xiaoqian Shen
Avinash Madasu
Mahmoud Salem
Jia Li
Gamaleldin F. Elsayed
Mohamed Elhoseiny
263
7
0
30 Aug 2023
Motion-Guided Masking for Spatiotemporal Representation Learning
Motion-Guided Masking for Spatiotemporal Representation LearningIEEE International Conference on Computer Vision (ICCV), 2023
D. Fan
Jue Wang
Shuai Liao
Yi Zhu
Vimal Bhat
H. Santos-Villalobos
M. Rohith
Xinyu Li
VGen
209
28
0
24 Aug 2023
MOFO: MOtion FOcused Self-Supervision for Video Understanding
MOFO: MOtion FOcused Self-Supervision for Video Understanding
Mona Ahmadian
Frank Guerin
Andrew Gilbert
307
4
0
23 Aug 2023
Previous
123...8910...192021
Next