Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2401.00849
Cited By

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

1 January 2024

Alex Jinpeng Wang

Kevin Qinghong Lin

Kevin Lin

Mike Zheng Shou

ArXiv (abs)PDF HTML HuggingFace (17 upvotes)Github

Papers citing "COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training"

12 / 12 papers shown

Cambrian-S: Towards Spatial Supersensing in Video

Cambrian-S: Towards Spatial Supersensing in Video

...

231

51

0

06 Nov 2025

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

288

11

0

27 Oct 2025

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningInternational Conference on Learning Representations (ICLR), 2025

...

321

15

0

02 Mar 2025

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

499

2

0

24 Feb 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs

Unhackable Temporal Rewarding for Scalable Video MLLMs

...

326

28

0

17 Feb 2025

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

...

1.0K

152

0

31 Dec 2024

Do Language Models Understand Time?

Do Language Models Understand Time?The Web Conference (WWW), 2024

1.0K

14

0

18 Dec 2024

Learning Video Context as Interleaved Multimodal Sequences

Learning Video Context as Interleaved Multimodal Sequences

Pengchuan Zhang

Mike Zheng Shou

292

15

0

31 Jul 2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal
Learning

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Alex Jinpeng Wang

Mike Zheng Shou

336

14

0

04 Jun 2024

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

Liang Pang

175

11

0

17 Apr 2024

Contextual AD Narration with Interleaved Multimodal Sequence

Contextual AD Narration with Interleaved Multimodal SequenceComputer Vision and Pattern Recognition (CVPR), 2024

566

11

0

19 Mar 2024

Video Understanding with Large Language Models: A Survey

Video Understanding with Large Language Models: A Survey

...

923

225

0

29 Dec 2023

Page 1 of 1