This repository aims to generate per-pixel features using pretrained models, Segment-Anything and CLIP. The pixel-aligned features are useful for downstream tasks such as visual grounding and VQA. First, we use the SAM to generate segmetation masks. Then, cropped images are sent into CLIP to extract semantic features. Finally, each pixel will be assigned semantic features according to its associated masks.
Here, we show open-vocabulary segmentation without any training and finetuning.
Input Image | Segment Segmentation |
---|---|
- You may need to install Segment-Anything and CLIP (or, OpenCLIP).
- Download one of SAM checkpoints from the SAM repository.
You can generate per-pixel features of an image.
python feature_autogenerator.py --image_path {image_path} --output_path {output_path} --output_name {feature_file_name} --checkpoint_dir {checkpoint_dir}
Or directly generate segmentation results by the given config file.
python segment.py --config_path {config_path}
If you find this work useful for your research, please consider citing this repo:
@misc{mingfengli_seganyclip,
title={Per-pixel Features: Mating Segment-Anything with CLIP},
author={Li, Ming-Feng},
url={https://github.com/justin871030/Segment-Anything-CLIP},
year={2023}
}