Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2408.00714
Cited By

SAM 2: Segment Anything in Images and Videos

SAM 2: Segment Anything in Images and Videos

International Conference on Learning Representations (ICLR), 2024

1 August 2024

Valentin Gabeur

Chaitanya K. Ryali

Roman Rädle

Laura Gustafson

Kalyan Vasudev Alwala

Ross B. Girshick

Piotr Dollár

Christoph Feichtenhofer

ArXiv (abs)PDF HTML HuggingFace (116 upvotes)

Papers citing "SAM 2: Segment Anything in Images and Videos"

50 / 860 papers shown

Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping

Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping

84

0

0

04 Nov 2025

Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

49

0

0

04 Nov 2025

UniChange: Unifying Change Detection with Multimodal Large Language Model

UniChange: Unifying Change Detection with Multimodal Large Language Model

348

0

0

04 Nov 2025

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

202

2

0

03 Nov 2025

RefVTON: person-to-person Try on with Additional Unpaired Visual Reference

RefVTON: person-to-person Try on with Additional Unpaired Visual Reference

Liuzhuozheng Li

351

0

0

02 Nov 2025

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

145

0

0

31 Oct 2025

AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

Evangelos E. Papalexakis

Mohamadhossein Noruzoliaee

169

0

0

30 Oct 2025

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Fabien Despinoy

Danda Pani Paudel

330

0

0

29 Oct 2025

Octopus-like Reaching Motion: A Perspective Inspired by Whipping

Octopus-like Reaching Motion: A Perspective Inspired by Whipping

93

0

0

29 Oct 2025

Generative AI for Healthcare: Fundamentals, Challenges, and Perspectives

Generative AI for Healthcare: Fundamentals, Challenges, and Perspectives

289

0

0

28 Oct 2025

Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

186

6

0

28 Oct 2025

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

...

350

2

0

28 Oct 2025

World Simulation with Video Foundation Models for Physical AI

World Simulation with Video Foundation Models for Physical AI

...

462

21

0

28 Oct 2025

Localising under the drape: proprioception in the era of distributed surgical robotic system

Localising under the drape: proprioception in the era of distributed surgical robotic system

Christopher E. Mower

...

Emmanuel Vander Poorten

Philipp Fürnstahl

Sebastien Ourselin

Christos Bergeles

Tom Vercauteren

125

0

0

27 Oct 2025

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

366

1

0

27 Oct 2025

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Igor Gilitschenski

195

0

0

27 Oct 2025

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

144

0

0

26 Oct 2025

Generalizable Hierarchical Skill Learning via Object-Centric Representation

Generalizable Hierarchical Skill Learning via Object-Centric Representation

...

147

0

0

24 Oct 2025

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

144

0

0

24 Oct 2025

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Hirokatsu Kataoka

Christian Rupprecht

126

1

0

24 Oct 2025

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

...

129

8

0

24 Oct 2025

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

...

128

0

0

24 Oct 2025

Thermal Polarimetric Multi-view Stereo

Thermal Polarimetric Multi-view Stereo

Takahiro Kushida

Kenichiro Tanaka

141

0

0

23 Oct 2025

COS3D: Collaborative Open-Vocabulary 3D Segmentation

COS3D: Collaborative Open-Vocabulary 3D Segmentation

161

1

0

23 Oct 2025

HRT1: One-Shot Human-to-Robot Trajectory Transfer for Mobile Manipulation

HRT1: One-Shot Human-to-Robot Trajectory Transfer for Mobile Manipulation

Sai Haneesh Allu

Jishnu Jaykumar P

Ninad Khargonkar

112

0

0

23 Oct 2025

PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

111

0

0

23 Oct 2025

GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

...

161

0

0

23 Oct 2025

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

...

447

7

0

22 Oct 2025

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

241

0

0

22 Oct 2025

Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models

Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models

129

0

0

22 Oct 2025

SAM 2++: Tracking Anything at Any Granularity

SAM 2++: Tracking Anything at Any Granularity

218

0

0

21 Oct 2025

Automated urban waterlogging assessment and early warning through a mixture of foundation models

Automated urban waterlogging assessment and early warning through a mixture of foundation models

152

0

0

21 Oct 2025

EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation

EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation

Maryam Dialameh

Hossein Rajabzadeh

149

0

0

21 Oct 2025

CaMiT: A Time-Aware Car Model Dataset for Classification and Generation

CaMiT: A Time-Aware Car Model Dataset for Classification and Generation

Biruk Abere Ambaw

Adrian Daniel Popescu

Romaric Audigier

Hervé Le Borgne

284

0

0

20 Oct 2025

Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats

Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats

Jose Luis Susa Rincon

193

0

0

20 Oct 2025

Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset

Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset

104

0

0

20 Oct 2025

World-in-World: World Models in a Closed-Loop World

World-in-World: World Models in a Closed-Loop World

...

234

6

0

20 Oct 2025

Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

163

2

0

19 Oct 2025

Safe Payload Transfer with Ship-Mounted Cranes: A Robust Model Predictive Control Approach

Safe Payload Transfer with Ship-Mounted Cranes: A Robust Model Predictive Control Approach

William A. Welch

Patrick Spieler

...

66

0

0

19 Oct 2025

Pursuing Minimal Sufficiency in Spatial Reasoning

Pursuing Minimal Sufficiency in Spatial Reasoning

Ming-Hsuan Yang

100

0

0

19 Oct 2025

How Universal Are SAM2 Features?

How Universal Are SAM2 Features?

Masoud Khairi Atani

132

0

0

19 Oct 2025

Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Mohammad Javad Ahmadi

Seyed-Farzad Mohammadi

Amirhossein Taslimi

Mehdi Khodaparast

109

0

0

18 Oct 2025

Promptable Fire Segmentation: Unleashing SAM2's Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance

Promptable Fire Segmentation: Unleashing SAM2's Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance

Emmanuel U. Ugwu

113

0

0

18 Oct 2025

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

Jiangning Zhang

116

0

0

18 Oct 2025

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

129

0

0

17 Oct 2025

Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation

Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation

105

0

0

17 Oct 2025

Proactive Scene Decomposition and Reconstruction

Proactive Scene Decomposition and Reconstruction

104

0

0

17 Oct 2025

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

98

2

0

16 Oct 2025

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

Takuya Narihira

126

1

0

16 Oct 2025

Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals

Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals

Saurabh Kataria

Hyunjung Gloria Kwak

...

Sivasubramanium V Bhavani

AI4TS AI4MH LM&MA

200

0

0

16 Oct 2025

1 2 3 4 5 6...16 17 18