v1v2v3 (latest)

Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

30 November 2025

Boran Wen

Ye Lu

Sirui Wang

Keyan Wan

Jiahong Zhou

Junxuan Liang

Xinpeng Liu

Bang Xiao

Ruiyang Liu

Yong-Lu Li

ArXiv (abs)PDF HTML Github

Main:14 Pages

21 Figures

7 Tables

Appendix:3 Pages

Abstract

Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. To overcome the annotation bottleneck, we introduce an efficient sparse contact annotation paradigm. To scale this process, we develop InterPoint, a multi-modal predictor that drives a human-in-the-loop data engine. Building upon these efficiently acquired annotations, we introduce 4DHOISolver, a novel optimization framework that constrains the ill-posed 4D HOI reconstruction problem, maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 135 object types and 133 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. Data and code will be publicly available atthis https URL.

View on arXiv

Comments on this paper