ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.00714
  4. Cited By
SAM 2: Segment Anything in Images and Videos

SAM 2: Segment Anything in Images and Videos

International Conference on Learning Representations (ICLR), 2024
1 August 2024
Nikhila Ravi
Valentin Gabeur
Yuan-Ting Hu
Ronghang Hu
Chaitanya K. Ryali
Tengyu Ma
Haitham Khedr
Roman Rädle
Chloe Rolland
Laura Gustafson
Eric Mintun
Junting Pan
Kalyan Vasudev Alwala
Nicolas Carion
Chao-Yuan Wu
Ross B. Girshick
Piotr Dollár
Christoph Feichtenhofer
    VLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (116 upvotes)

Papers citing "SAM 2: Segment Anything in Images and Videos"

50 / 863 papers shown
RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow
RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow
Liang Yao
Fan Liu
Hongbo Lu
Chuanyi Zhang
Rui Min
Shengxiang Xu
Shimin Di
Pai Peng
LRM
250
8
0
24 Dec 2025
Prompt2Craft: Generating Functional Craft Assemblies with LLMs
Prompt2Craft: Generating Functional Craft Assemblies with LLMs
V. H. Isume
Takuya Kiyokawa
N. Yamanobe
Y. Domae
Weiwei Wan
Kensuke Harada
126
0
0
04 Dec 2025
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
Kai-Po Chang
Wei-Yuan Cheng
Chi-Pin Huang
Fu-En Yang
Yu-Jie Wang
275
1
0
04 Dec 2025
Refaçade: Editing Object with Given Reference Texture
Refaçade: Editing Object with Given Reference Texture
Youze Huang
Penghui Ruan
Bojia Zi
Xianbiao Qi
Jianan Wang
Rong Xiao
DiffM
185
0
0
04 Dec 2025
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Siyi Chen
Mikaela Angelina Uy
Chan Hee Song
Faisal Ladhak
Adithyavairavan Murali
Qing Qu
Stan Birchfield
Valts Blukis
Jonathan Tremblay
OffRLLRM
162
0
0
03 Dec 2025
ViDiC: Video Difference Captioning
ViDiC: Video Difference Captioning
J. Wu
S. Li
Zhaozhou Bian
J. Chen
Runzhe Wen
An Ping
Yiwen He
Jiakai Wang
Yuanxing Zhang
Jiaheng Liu
174
0
0
03 Dec 2025
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Zhishan Zhou
Siyuan Wei
Zengran Wang
Chunjie Wang
Xiaosheng Yan
Xiao Liu
71
0
0
03 Dec 2025
DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
Sheng-Hao Liao
Shang-Fu Chen
Tai-Ming Huang
Wen-Huang Cheng
Kai-Lung Hua
DiffM
135
0
0
03 Dec 2025
CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding
CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding
Zhou Chen
Joe Lin
Carson Bulgin
Sathyanarayanan N. Aakur
39
0
0
03 Dec 2025
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
Kairun Wen
Yuzhi Huang
Runyu Chen
Hui Zheng
Yunlong Lin
...
Justin Theiss
Yue Huang
Xinghao Ding
Rakesh Ranjan
Zhiwen Fan
VGen
415
0
0
02 Dec 2025
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Y. Li
Yingda Yin
Lingting Zhu
Weikai Chen
Shengju Qian
Xin Wang
Yanwei Fu
VOSLRM
396
0
0
02 Dec 2025
SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction
SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction
Shengkai Wu
Jinrong Yang
Wenqiu Luo
Linfeng Gao
Chaohui Shang
Meiyu Zhi
Mingshan Sun
Fangping Yang
Liangliang Ren
Yong Zhao
131
0
0
02 Dec 2025
Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
Chenshuang Zhang
Kang Zhang
Joon Son Chung
In So Kweon
Junmo Kim
Chengzhi Mao
DiffM
238
0
0
02 Dec 2025
Experimental Characterization of Fingertip Trajectory following for a 3-DoF Series-Parallel Hybrid Robotic Finger
Experimental Characterization of Fingertip Trajectory following for a 3-DoF Series-Parallel Hybrid Robotic Finger
Nicholas Baiata
Nilanjan Chakraborty
170
0
0
02 Dec 2025
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Junwon Lee
Juhan Nam
Jiyoung Lee
DiffMVGen
113
0
0
02 Dec 2025
Generative Video Motion Editing with 3D Point Tracks
Yao-Chih Lee
Zhoutong Zhang
Jiahui Huang
Jui-Hsien Wang
Joon-Young Lee
Jia-Bin Huang
Eli Shechtman
Zhengqi Li
DiffMVGen3DPC
273
0
0
01 Dec 2025
AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation
Yexin Liu
Wen-Jie Shu
Zile Huang
Haoze Zheng
Yueze Wang
Manyuan Zhang
Ser-Nam Lim
Harry Yang
DiffMVGen
90
0
0
01 Dec 2025
Learning Visual Affordance from Audio
Learning Visual Affordance from Audio
Lidong Lu
Guo Chen
Zhu Wei
Yicheng Liu
Tong Lu
153
0
0
01 Dec 2025
SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation
Zisu Li
Hengye Lyu
Jiaxin Shi
Yufeng Zeng
Mingming Fan
Hanwang Zhang
Chen Liang
VGen
192
0
0
01 Dec 2025
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Keming Ye
Z. Huang
Canmiao Fu
Qingyang Liu
Jiani Cai
Zheqi Lv
Chen Li
Jing Lyu
Zhou Zhao
Shengyu Zhang
77
0
0
01 Dec 2025
VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering
VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering
Zihua Liu
Hiroki Sakuma
Masatoshi Okutomi
3DPC
112
0
0
01 Dec 2025
SAM3-UNet: Simplified Adaptation of Segment Anything Model 3
SAM3-UNet: Simplified Adaptation of Segment Anything Model 3
Xinyu Xiong
Zihuang Wu
Lei Lu
Yufa Xia
174
0
0
01 Dec 2025
PAI-Bench: A Comprehensive Benchmark For Physical AI
Fengzhe Zhou
Jiannan Huang
Jialuo Li
Deva Ramanan
Humphrey Shi
VGen
169
3
0
01 Dec 2025
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Jiazhen Liu
Mingkuan Feng
Long Chen
96
0
0
29 Nov 2025
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
Minh-Quan Le
Yuanzhi Zhu
Vicky Kalogeiton
Dimitris Samaras
EGVMVGen
91
1
0
29 Nov 2025
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Shuo Ni
Di Wang
He Chen
Haonan Guo
Ning Zhang
Jing Zhang
AI4TSVLM
225
1
0
28 Nov 2025
InstanceV: Instance-Level Video Generation
InstanceV: Instance-Level Video Generation
Yuheng Chen
Teng Hu
Jiangning Zhang
Zhucun Xue
Ran Yi
Lizhuang Ma
DiffMVGen
127
0
0
28 Nov 2025
Object-Centric Data Synthesis for Category-level Object Detection
Object-Centric Data Synthesis for Category-level Object Detection
Vikhyat Agarwal
Jiayi Cora Guo
Declan Hoban
Sissi Zhang
Nicholas Moran
Peter Cho
Srilakshmi Pattabiraman
Shantanu Joshi
3DPC
229
0
0
28 Nov 2025
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
H. Rasheed
Mohammed Zumri
Muhammad Maaz
Ming-Hsuan Yang
Fahad Shahbaz Khan
Salman Khan
AI4TSLRM
168
0
0
28 Nov 2025
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Yuta Oshima
Daiki Miyake
Kohsei Matsutani
Yusuke Iwasawa
Masahiro Suzuki
Yutaka Matsuo
Hiroki Furuta
67
0
0
28 Nov 2025
Optimizing Multimodal Language Models through Attention-based Interpretability
Optimizing Multimodal Language Models through Attention-based Interpretability
Alexander Sergeev
Evgeny Kotelnikov
200
0
0
28 Nov 2025
DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention
DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention
Furkan Guzelant
Arda Goktogan
Tarık Kaya
Aysegül Dündar
78
0
0
27 Nov 2025
Improving Robotic Manipulation Robustness via NICE Scene Surgery
Improving Robotic Manipulation Robustness via NICE Scene Surgery
Sajjad Pakdamansavoji
Mozhgan Pourkeshavarz
Adam Sigal
Zhiyuan Li
Rui Heng Yang
Amir Rasouli
84
0
0
27 Nov 2025
Geometrically-Constrained Agent for Spatial Reasoning
Geometrically-Constrained Agent for Spatial Reasoning
Zeren Chen
Xiaoya Lu
Zhijie Zheng
Pengrui Li
Lehan He
Yijin Zhou
Jing Shao
Bohan Zhuang
Lu Sheng
LRM
121
0
0
27 Nov 2025
Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data
Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data
Satrajit Chakrabarty
Ravi Soni
MedImVLM
196
0
0
26 Nov 2025
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Shuhan Xia
Peipei Li
Xuannan Liu
Dongsen Zhang
Xinyu Guo
Zekun Li
AAML
243
0
0
26 Nov 2025
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
M.Naseer Subhani
224
1
0
26 Nov 2025
CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
Shizhe Sun
Wataru Ohyama
224
0
0
26 Nov 2025
CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
Dianbing Xi
Jiepeng Wang
Yuanzhi Liang
Xi Qiu
Jialun Liu
...
Yuchi Huo
Rui Wang
H. Huang
Chi Zhang
Xuelong Li
DiffMVGen
213
0
0
26 Nov 2025
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Andrey Lemeshko
Bulat Gabdullin
Nikita Drozdov
Anton Konushin
D. Rukhovich
Maksim Kolodiazhnyi
3DPCObjDVLM
441
0
0
25 Nov 2025
SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM
SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM
Lin Chen
Yingjian Zhu
Qi Yang
Xin Niu
Kun Ding
Shiming Xiang
VLM
149
0
0
25 Nov 2025
Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance
Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance
Haoxuan Wang
Jiachen Tao
Junyi Wu
Gaowen Liu
Ramana Rao Kompella
Yan Yan
VGen
197
0
0
25 Nov 2025
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
GigaWorld Team
Angen Ye
Boyuan Wang
Chaojun Ni
Guan Huang
...
Yang Wang
Yukun Zhou
Z. Zhang
Z. Dong
Zheng Zhu
VGenLM&Ro
391
2
0
25 Nov 2025
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Weijia Mao
Hao Chen
Zhenheng Yang
Mike Zheng Shou
EGVM
278
0
0
25 Nov 2025
MedSAM3: Delving into Segment Anything with Medical Concepts
MedSAM3: Delving into Segment Anything with Medical Concepts
Anglin Liu
Rundong Xue
Xu Cao
Yifan Shen
Yi Lu
Xiang Li
Qianqian Chen
Jintai Chen
MedImVLM
484
0
0
24 Nov 2025
Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
Yun Zhou
Yaoting Wang
Guangquan Jie
Jinyu Liu
Henghui Ding
75
0
0
24 Nov 2025
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Shuai Wang
D. Zhang
Tianyi Bai
Shitong Shao
Jiebo Luo
Jiaheng Wei
VLM
175
1
0
24 Nov 2025
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
Omar Alama
Darshil Jariwala
A. Bhattacharya
Seungchan Kim
Wenshan Wang
Sebastian A. Scherer
VLM
183
0
0
24 Nov 2025
CataractCompDetect: Intraoperative Complication Detection in Cataract Surgery
CataractCompDetect: Intraoperative Complication Detection in Cataract Surgery
Bhuvan Sachdeva
Sneha Kumari
Rudransh Agarwal
Shalaka Kumaraswamy
Niharika Singri Prasad
...
Raphael Lechtenboehmer
M. Wintergerst
T. Schultz
K. Murali
Mohit Jain
99
0
0
24 Nov 2025
IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
Carl Lindström
Mahan Rafidashti
M. Fatemi
Lars Hammarstrand
Martin R. Oswald
Lennart Svensson
3DGS
191
1
0
24 Nov 2025
1234...161718
Next
Page 1 of 18
Pageof 18