COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis

7 March 2019

Jie Zhou

Papers citing "COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis"

50 / 267 papers shown

Scalable Video-to-Dataset Generation for Cross-Platform Mobile AgentsComputer Vision and Pattern Recognition (CVPR), 2025

241

19 May 2025

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

612

08 May 2025

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

...

272

24 Apr 2025

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

...

254

10 Apr 2025

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

307

10 Apr 2025

Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards

Hanping Zhang

Yuhong Guo

OffRL

288

03 Apr 2025

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video ContextsComputer Vision and Pattern Recognition (CVPR), 2025

331

29 Mar 2025

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

676

27 Mar 2025

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

435

25 Mar 2025

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

395

19 Mar 2025

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

282

18 Mar 2025

Quantum EigenGame for excited state calculation

David Quiroga

Jason Han

Anastasios Kyrillidis

280

17 Mar 2025

VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

348

16 Mar 2025

VLog: Video-Language Models by Generative Retrieval of Narration VocabularyComputer Vision and Pattern Recognition (CVPR), 2025

Kevin Qinghong Lin

Mike Zheng Shou

VGen

1.0K

12 Mar 2025

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

1.1K

11 Mar 2025

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

Lei Shi

Andreas Bulling

DiffM

290

09 Mar 2025

StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

926

08 Mar 2025

Data Augmentation for Instruction Following Policies via Trajectory SegmentationAAAI Conference on Artificial Intelligence (AAAI), 2025

Niklas Höpner

Ilaria Tiddi

H. V. Hoof

229

25 Feb 2025

Leveraging Procedural Knowledge and Task Hierarchies for Efficient Instructional Video Pre-training

Karan Samel

Nitish Sontakke

Irfan Essa

243

24 Feb 2025

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Mohammad Mahdi Abootorabi

Amirhosein Zobeiri

Mahdi Dehghani

Mohammadali Mohammadkhani

695

12 Feb 2025

Multimodal Fusion and Coherence Modeling for Video Topic SegmentationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

430

31 Dec 2024

SUTrack: Towards Simple and Unified Single Object Tracking

282

26 Dec 2024

Do Language Models Understand Time?The Web Conference (WWW), 2024

Xi Ding

Lei Wang

912

18 Dec 2024

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video PromptingAAAI Conference on Artificial Intelligence (AAAI), 2024

Muhammet Furkan Ilaslan

280

16 Dec 2024

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language LearningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

287

10 Dec 2024

Video LLMs for Temporal Reasoning in Long Videos

651

04 Dec 2024

Progress-Aware Video Frame CaptioningComputer Vision and Pattern Recognition (CVPR), 2024

608

03 Dec 2024

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction FormatConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

268

27 Nov 2024

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal GroundingComputer Vision and Pattern Recognition (CVPR), 2024

344

25 Nov 2024

ACE: Action Concept Enhancement of Video-Language Models in Procedural VideosIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

284

23 Nov 2024

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet VideosNeural Information Processing Systems (NeurIPS), 2024

270

18 Nov 2024

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional VideosComputer Vision and Pattern Recognition (CVPR), 2024

417

13 Nov 2024

Don't Look Twice: Faster Video Transformers with Run-Length TokenizationNeural Information Processing Systems (NeurIPS), 2024

245

07 Nov 2024

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Leonardo Plini

Luca Scofano

Edoardo De Matteis

Guido Maria DÁmely di Melendugno

366

04 Nov 2024

Video Token Merging for Long-form Video Understanding

290

31 Oct 2024

ProMQA: Question Answering Dataset for Multimodal Procedural Activity UnderstandingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Kimihiro Hasegawa

Wiradee Imrattanatrai

249

29 Oct 2024

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded TuningInternational Conference on Learning Representations (ICLR), 2024

...

Yali Wang

286

25 Oct 2024

Human Action Anticipation: A Survey

295

17 Oct 2024

OMCAT: Omni Context Aware Transformer

228

15 Oct 2024

When Does Perceptual Alignment Benefit Vision Representations?Neural Information Processing Systems (NeurIPS), 2024

275

14 Oct 2024

Human Stone Toolmaking Action Grammar (HSTAG): A Challenging Benchmark for Fine-grained Motor Behavior RecognitionInternational Conference on Data Science and Advanced Analytics (DSAA), 2024

130

10 Oct 2024

Exploring Efficient Foundational Multi-modal Models for Video Summarization

Karan Samel

Apoorva Beedu

Nitish Sontakke

Irfan Essa

132

09 Oct 2024

Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond

Shanshan Han

595

09 Oct 2024

TRACE: Temporal Grounding Video LLM via Causal Event ModelingInternational Conference on Learning Representations (ICLR), 2024

Jingyu Liu

Xiaoying Tang

281

08 Oct 2024

EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts

273

07 Oct 2024

Optimising for the Unknown: Domain Alignment for Cephalometric Landmark Detection

Julian Wyatt

Irina Voiculescu

189

06 Oct 2024

VEDIT: Latent Prediction Architecture For Procedural Video Representation LearningInternational Conference on Learning Representations (ICLR), 2024

297

04 Oct 2024

AirLetters: An Open Video Dataset of Characters Drawn in the Air

188

03 Oct 2024

Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents

Chen Zhang

Enhong Chen

244

01 Oct 2024

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional VideosEuropean Conference on Computer Vision (ECCV), 2024

Md. Mohaiminul Islam

Tushar Nagarajan

Huiyu Wang

235

30 Sep 2024