Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under 1,000,X−Capturedemocratizesthecreationofmulti−sensorydatasets,requiringonlyconsumer−gradetoolsforassembly.UsingX−Capture,wecurateasampledatasetof3,000totalpointson500everydayobjectsfromdiverse,real−worldenvironments,offeringbothrichnessandvariety.Ourexperimentsdemonstratethevalueofboththequantityandthesensorybreadthofourdataforbothpretrainingandfine−tuningmulti−modalrepresentationsforobject−centrictaskssuchascross−sensoryretrievalandreconstruction.X−Capturelaysthegroundworkforadvancinghuman−likesensoryrepresentationsinAI,emphasizingscalability,accessibility,andreal−worldapplicability.
@article{clarke2025_2504.02318,
title={ X-Capture: An Open-Source Portable Device for Multi-Sensory Learning },
author={ Samuel Clarke and Suzannah Wistreich and Yanjie Ze and Jiajun Wu },
journal={arXiv preprint arXiv:2504.02318},
year={ 2025 }
}