OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation

IROS 2025 Under Review

1Beijing Institute of Technology
*Equal contribution

Poster

teaser

We introduce OpenVox, a framework of real-time instance-level open-vocabulary probabilistic voxel representation.

Abstract

In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. To address these challenges, we introduce OpenVox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding caption. In the back-end, we implement probabilistic instance voxels and formalize the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. Furthermore, real-world robotics experiments validate OpenVox's capability for stable, real-time operation.

Framework

In the front-end, the Instance Segmentation & Understanding module implements an efficient pipeline for instance-level semantic understanding, powered by caption encoding. It processes RGB image frames to generate 2D instance segmentation masks and their corresponding semantic annotations. In the back-end, we project the 2D masks onto a 3D map and perform probabilistic updating. This process is modeled as two subtasks: instance association and map updating. The first subtask involves associating instances from the observed masks and maps by solving the maximum likelihood estimation (MLE) problem, while the second subtask updates the voxel instance vector by solving the maximum a posteriori (MAP) problem.

teaser

Results

3D Zero-shot Instance Segmentation

Visualisation of scene obtaind by different methods.



ConceptGraphs Ours
RGB
ConceptGraphs Ours
RGB
ConceptGraphs Ours
RGB
ConceptGraphs Ours
RGB
ConceptGraphs Ours
RGB
ConceptGraphs Ours
RGB

Segmentation

GT RGB


3D Zero-shot Semantic Segmentation

Visualisation of scene obtaind by different methods.



ConceptFusion ConceptGraphs Open-Fusion Ours
RGB
ConceptFusion ConceptGraphs Open-Fusion Ours
RGB
ConceptFusion ConceptGraphs Open-Fusion Ours
RGB
ConceptFusion ConceptGraphs Open-Fusion Ours
RGB

Segmentation

GT RGB


Open-vocabulary Instance Retrieval

teaser


Real-World Onboard Experiment

OpenVox has the ability to run online in the real world..

teaser