ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K


Kaixuan Wang1 *, Tianxing Chen1 2 *, Jiawei Liu10 *, Honghao Su10 *, Shaolong Zhu2 *, Mingxuan Wang10, Zixuan Li10,
Yue Chen8, Huan-ang Gao9, Yusen Qin4, Jiawei Wang3 6, Qixuan Zhang3 5, Lan Xu5, Jingyi Yu5, Yao Mu7 †, Ping Luo1 †


1The University of Hong Kong, 2Xspark AI, 3Deemos Tech, 4D-Robotics, 5ShanghaiTech University, 6University of California, San Diego, 7Shanghai Jiao Tong University, 8Peking University, 9Tsinghua University, 10Shenzhen University
* Equal contribution, † Corresponding author

input image

Overview

Abstract. Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning.

Our Pipeline

input image

ManiTwin Pipeline. Our pipeline consists of three stages: (I) Asset Generation transforms input images into simulation-ready 3D meshes with VLM-estimated physical properties; (II) Asset Annotation combines FPS-based candidate sampling, VLM-driven functional and grasp point selection, and learning-based grasp proposal generation; (III) Verification validates annotations through physics simulation and human review, producing fully annotated digital twins ready for robotic manipulation research.

Annotation Visualization

input image

Application 1: Manipulation Data Generation Across Diverse Embodiments

input image

ManiTwin Data Generation. (Left) Cross-embodiment manipulation trajectories across multiple end-effectors using shared object annotations. (Right) Grasping data generation.

Application 2: Layout Generation

input image

Layout Generation. Using placement and collision radius annotations, we generate diverse multi-object scene layouts that are collision-free and physically plausible.

Application 3: VQA Data Generation

input image

VQA Data Generation. Each pair links manipulation-relevant questions to grounded scene understanding, covering language grounding, functional planning, scene understanding, task planning, and object detection.

Distribution

input image
input image

Assets Example

input image

ManiTwin-100K Dataset Examples. Each row shows one object. From left to right: input image, generated 3D asset, mesh visualization, and samples of simulation-verified grasp poses.

BibTeX

If you have any questions, please contact us at kaixuan.wang@connect.hku.hk.