Skip to content

DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding (EMNLP 2025)

Notifications You must be signed in to change notification settings

AI4Patents/DesignCLIP

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding (EMNLP 2025)

We introduce DesignCLIP, a multimodal model trained on large-scale design data including all patents from 2007 to 2022 from USPTO Bulk Data Storage System (BDSS).

✒️ To address the unique characteristics of patent data, we incorporate class-aware classification and contrastive learning, generate detailed captions for patent images and multi-views image learning.

main_fig

Dataset

📗 We will realse full data soon.

  • Sample images from recent 5 years can be viewed and download here.

  • Sample generated captions for the recent 5 years patent images can be viewed and download here.

DesignCLIP

🔥 DesignCLIP is based on CLIP, and we use an open source open_clip implementation and incorporate class-aware classification and contrastive learning.

🤗 PatentCLIP-ViT-B [checkpoint]

Usage

Load a DesignCLIP model:

import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:patentclip/PatentCLIP_Vit_B', device=device)
tokenizer = open_clip.get_tokenizer('hf-hub:patentclip/PatentCLIP_Vit_B')

1. Multimodal retrieval results

Multimodal retrieval results for Image to Text and Text to image using both CLIP and PATENTCLIP moodels.

Model Backbone Text-Image Image-text
R@5 R@10 R@5 R@10
RN50 5.47 8.51 5.24 7.72
CLIP RN101 7.60 11.17 6.10 9.35
ViT-B 7.49 10.60 6.90 10.34
ViT-L 13.26 18.29 12.07 17.17
RN50 25.17 34.50 23.49 32.70
DesignCLIP RN101 26.71 36.51 25.37 34.84
ViT-B 29.75 39.91 28.39 38.26
ViT-L 41.72 52.55 39.59 50.44

2. Patent Classification

python classification.py

Classification results (Accuracy (%)) for both CLIP and PATENTCLIP in Zero-shot and Fine-tuned settings. Datasetr used here are from the year 2023.

Model Backbone Zero-shot Fine-tuned
CLIP RN101 11.91 15.47
ViT-B 10.88 38.99
DesignCLIP RN101 11.93 29.92
ViT-B 14.70 41.34

3. Patent Image Retrieval

  • Dowanload DeepPatent dataset for image retrieval

  • Training DesignCLIP + ArcFace on DeepPatent:

python ir_main.py

Citations

@inproceedings{
wang2025designclip,
title={Design{CLIP}: Multimodal Learning with {CLIP} for Design Patent Understanding},
author={Zhu Wang and Homaira Huda Shomee and Sathya N. Ravi and Sourav Medya},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=pTumSzkDLC}
}

Acknowledgement

The implementation of PatentCLIP relies on resources from open_clip, LLaVA, and SWIN + ArcFace. We thank the original authors for their open-sourcing.

About

DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding (EMNLP 2025)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.0%
  • Shell 2.9%
  • Makefile 0.1%