Real-time Semantic 6D Pose Estimation

Autoselects semantic masks and launches FoundationPose in real-time

Skills

3D Computer Vision, Vision Foundation Models, Real-time deployment, RGB-D, Camera Calibration
Frameworks: PyTorch, OpenCV, Open3D, TensorRT, ONNX

Summary

Deployed a 6D Pose Estimation system in real-time utilizing RGB-D from a Realsense D435i camera. The first frame is used to generate CLIP-guided mask generation process with FastSAM. Then NVIDIA’s FoundationPose model consumes the mask as semantic guidance, and provides single-shot 6D pose predictions using a pre-existing mesh of the target object; in this case, a tote or Xbox controller. The mesh was generated using a LiDAR-enabled iPhone through an AR application and was subsequently refined manually in Blender to ensure accuracy.

GitHub
CLIP-guided semantic mask generated by FastSAM for the tote object. Text prompt: Black dark box container placed on top of table.
3D mesh of the tote object used for 6D pose estimation, generated using LiDAR-enabled iPhone AR and refined in Blender.
Real-time 6D pose estimation system in action, demonstrating real-time pose estimation.