-
Chat-Scene has provided all the prepared data in HuggingFace. Simply download the files and place them in the annotations/ directory. You’ll then be ready to run and test the code.
-
We've provided preprocessed VL-SAT features for semantic relations between objects as well as additional text annotations in Yandex Disk
-
We've provided VL-SAT features for fully-connected graphs with semantic relations between objects in Yandex Disk (output_vlsat.zip)
-
Download the ScanNet dataset by following the ScanNet instructions.
-
Extract object masks using a pretrained 3D detector:
- Use Mask3D for instance segmentation. We used the checkpoint pretrained on ScanNet200.
- The complete predicted results (especially the masks) for the train/validation sets are too large to share (~40GB). We’ve shared the post-processed results:
- Unzip the
mask3d_inst_seg.tar.gzfile. - Each file under
mask3d_inst_segcontains the predicted results for a single scene, including a list of segmented instances with their labels and segmented indices.
- Unzip the
-
Process object masks and prepare annotations:
- If you use Mask3D for instance segmentation, set the
segment_result_dirin run_prepare.sh to the output directory of Mask3D. - If you use the downloaded
mask3d_inst_segdirectly, setsegment_result_dirto None and setinst_seg_dirto the path ofmask3d_inst_seg. - Run:
bash preprocess/run_prepare.sh
- If you use Mask3D for instance segmentation, set the
-
Extract 3D features using a pretrained 3D encoder:
- Follow Uni3D to extract 3D features for each instance. We used the pretrained model uni3d-g.
- We've also provided modified code for feature extraction in this forked repository. Set the
data_dirhere to the path to${processed_data_dir}/pcd_all(processed_data_diris an intermediate directory set inrun_prepare.sh). After preparing the environment, runbash scripts/inference.sh.
-
Extract 2D features using a pretrained 2D encoder:
-
We followed OpenScene's code to calculate the mapping between 3D points and 2D image pixels. This allows each object to be projected onto multi-view images. Based on the projected masks on the images, we extract and merge DINOv2 features from multi-view images for each object.
-
[TODO] Detailed implementation will be released.
-
-
Obtain connections based on the N nearest neighbors for each object, filter the fully connected graphs with VLSAT features for Mask3D segmentation. To achieve this, run the
prepare_filtered_mask3d_gnn_data.pyscript after updating the paths to the directories containing the fully connected graphs for each scene, the object attributes, and the ScanNet splits. The number of nearest neighbors can be adjusted by modifying theKNNparameter at the beginning of theprepare_filtered_mask3d_gnn_data.pyscript. -
Obtain connections based on the N nearest neighbors for each object, filter the fully connected graphs with VLSAT features for GT segmentation. To achieve this, run the
prepare_gnn_data.pyscript after updating the paths to the directories containing the fully connected graphs for each scene, the object attributes, and the ScanNet splits. The number of nearest neighbors can be adjusted by modifying theKNNparameter at the beginning of theprepare_gnn_data.pyscript.