Jiading Fang*, Xiangshan Tan*, Shengjie Lin*, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus Gregory Shakhnarovich Matthew Walter
Transcrib3d_real_robot_demo_compressed.mp4
Transcrib3D reasons about and acts according to complex 3D referring expressions with real robots.
For evaluation, a small number of packages are required, include numpy, openai and tenacity.
pip install numpy openai tenacity
Additional packages are needed for data preprocessing:
pip install plyfile scikit-learn scipy pandas
Set up your OpenAI API key as an environment variable OPENAI_API_KEY
:
export OPENAI_API_KEY=xxx
Since the ReferIt3D dataset, which includes sr3d and nr3d, and the ScanRefer dataset depend on ScanNet, we first preproces the ScanNet data.
To make things easier, we provide the bounding boxes for each scene at data/scannet_object_info
. Currently, it only includes ground-truth bounding boxes (which is the setting for NR3D and SR3D from the Referit3D benchmark). Detected bounding boxes will be provided later. There is no need to prepare the original ScanNet scene data for the sole purpose of testing (original scene data are still useful for debugging and visualization).
You could jump to Evaluation to get a quick start.
If you want to generate the bounding boxes from the original ScanNet data, follow the steps below.
Follow the official instructions to download the data. This involves filling out a form and emailing the ScanNet authors. Then, you will receive a response email with detailed instructions and a Python script download-scannet.py
for downloading the data. Run the script to download certain types of data:
python download-scannet.py -o [directory in which to download] --type [file suffix]
Since the original 1.3TB ScanNet data contains many types of data files, some of which are not necessary for this project (e.g., the RGBD stream .sens
type), you could use the optional argument --type
to download only the necessary types:
_vh_clean_2.ply _vh_clean_2.labels.ply _vh_clean_2.0.010000.segs.json _vh_clean.segs.json .aggregation.json _vh_clean.aggregation.json .txt
Run the following shell script/CMD instruction to download them (to avoid any key-pressing during download, comment the code key = input('')
at line 147 and 225):
# bash
download_dir="your_scannet_download_directory"
suffixes=(
"_vh_clean_2.ply"
"_vh_clean_2.labels.ply"
"_vh_clean_2.0.010000.segs.json"
"_vh_clean.segs.json"
".aggregation.json"
"_vh_clean.aggregation.json"
".txt"
)
for suffix in "${suffixes[@]}"; do
python download-scannet.py -o "$download_dir" --type "$suffix"
done
CMD
set download_dir="your_scannet_download_directory"
set suffixes=_vh_clean_2.ply;_vh_clean_2.labels.ply;_vh_clean_2.0.010000.segs.json;_vh_clean.segs.json;.aggregation.json;_vh_clean.aggregation.json;.txt
for %s in (%suffixes%) do (
python download-scannet.py -o %download_dir% --type %s
)
After downloading, your directory structure should look like:
your_scannet_download_directory/
|-- scans/
| |-- scene0000_00/
| | |-- scene0000_00_vh_clean_2.ply
| | |-- scene0000_00_vh_clean_2.labels.ply
| | |-- scene0000_00_vh_clean_2.0.010000.segs.json
| | |-- scene0000_00_vh_clean.segs.json
| | |-- scene0000_00.aggregation.json
| | |-- scene0000_00_vh_clean.aggregation.json
| | |-- scene0000_00.txt
| |-- scenexxxx_xx/
| | |-- ...
|-- scans_test/
| |-- scene0707_00/
| |-- ...
|-- scannetv2-labels.combined.tsv
Next, use the axis align matrices (recorded in scenexxxx_xx.txt
) to transform the coordinates of vertices:
python preprocessing/align_scannet_mesh.py --scannet_download_path [your_scannet_download_directory]
Follow the ReferIt3D official guide to download nr3d.csv
, sr3d.csv
, sr3d_train.csv
, sr3d_test.csv
and save them in the data/referit3d
folder.
Follow the ScanRefer official guide to download the dataset and place them within the data/scanrefer
folder.
In this step, we process the ScanNet data to extract the quantitative and semantic information of the objects in each scene.
For object instance segmentation, we use either ground-truth (ScanNet official) data or an off-the-shelf segmentation tool (Mask3d).
To use ground-truth segmentation data, run:
python preprocessing/gen_obj_list.py --scannet_download_path [your_scannet_download_directory] --bbox_type gt
You can find the results in scannet_download_path/scans/objects_info/
and scannet_download_path/scans_test/objects_info/
.
To use Mask3D segmentation data, first follow the Mask3D official guide to produce the instance segmentation results, then run:
python preprocessing/gen_obj_list.py --scannet_download_path [your_scannet_download_directory] \
--bbox_type mask3d \
--mask3d_result_path [your_mask3d_result_directory]
# Note: mask3d_result_path should look like xxx/Mask3D/eval_output/instance_evaluation_mask3d_export_scannet200_0/val/
You can find the results in scannet_download_path/scans/objects_info_mask3d_200c/
.
Run the first 50 data records of nr3d_test_sampled1000.csv with config index 1:
python main.py --workspace_path /path/to/Transcribe3D/project/folder --scannet_data_root /path/to/ScanNet/Data/ --mode eval --dataset_type nr3d --conf_idx 1 --range 2 52
Remember to replace the paths.
Note that scannet_data_root
can be set to /path/to/Transcribe3D/project/folder/data/scannet_object_info
as we already provide the ground-truth ScanNet bounding boxes. If you preprocess the data by yourself, it can be set to scannet_download_path/scans/objects_info/
.
-
To run our model on different refering datasets, simply modify the
--dataset_type
setting to [sr3d/nr3d/scanrefer]. -
To select the evaluation range of the dataset, modify the
--range
setting. For Sr3D and Nr3D, which use .csv files, the minimum number is 2. For ScanRefer, which uses .json files, the minimum number is 0. -
For convenience, more configurations are placed in
config/config.py
. There are 3 dictionaries inside: confs_nr3d, confs_sr3d and confs_scanrefer. Each of them contains several configurations of that dataset. The meaning of different configurations can be understood from the variable names. Modify the--conf_idx
setting to select a configuration. You can also add your own configurations. -
More information can be found by running
python main.py -h
.
After running the evaluation with specific configurations, a folder will be created that contains configuration infomation with a name that starts with eval_results_
under the results
folder. Under this folder, there will be subfolders named after the start time of the experiment.
You might run one or more experiments of a evaluation configuration, and get some subfolders named according to the formatted time. The time(s) are used to analyze the results. An example timestamp looks like 2023-10-26-15-48-12
.
Specify the formatted time(s) after the --ft
setting:
python main.py --workspace_path /path/to/Transcribe3D/project/folder/ --scannet_data_root /path/to/ScanNet/Data/ --mode result --dataset_type nr3d --conf_idx 1 --ft time1 time2
Check how many cases are provided with detected boxes that has 0.5 or higher IOU with the ground-truth box, which indicates the upper bound of performance on ScanRefer.
python main.py --workspace_path /path/to/Transcribe3D/project/folder/ --scannet_data_root /path/to/ScanNet/Data/ --mode check_scanrefer --dataset_type scanrefer --conf_idx 1
We provide scripts for finetuning on open-source LLMs (e.g., codeLlama, Llama2) within the finetune
directory.
The script uses the Huggingface trl
(https://github.com/huggingface/trl) library to perform finetuning jobs. Main dependencies include Huggingface accelerate
, transformers
, datasets
, peft
, trl
.
We provide processed finetuning data following the OpenAI finetune file protocal in the finetune/finetune_files
directory. It contains many different settings aligned as described in our paper. The original processing script is finetune/prepare_finetuning_data.py
, which processes results from the main script.
We provide two example shell scripts to run the finetuning jobs, one with codellama
model (finetune/trl_finetune_codellama_instruct.sh
) and the other with llama2_chat
model (finetune/trl_finetune_llama2_chat.sh
). You can also customize finetuning job using finetune/trl_finetune.py
.
- The finetuned open-source models (e.g., codeLlama, Llama2) still under-performs the finetuned closed-source model (gpt-3.5-turbo) as of September 2023. We expect the situation might change dramatically in the coming future with quickly improving open-source models.
- The resource required for finetuning is roughly 24GB+ GPU memory for 7B models and 36GB+ GPU memory for 13B models.
If you find our paper useful, and use it in a publication, we would appreciate it if you cite it as:
@misc{fang2024transcrib3d3dreferringexpression,
title={Transcrib3D: 3D Referring Expression Resolution through Large Language Models},
author={Jiading Fang and Xiangshan Tan and Shengjie Lin and Igor Vasiljevic and Vitor Guizilini and Hongyuan Mei and Rares Ambrus and Gregory Shakhnarovich and Matthew R Walter},
year={2024},
eprint={2404.19221},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2404.19221},
}