This repository contains research work on finding prior art for a given patent. The approach is to find the most similar documents for a given patent application by comparing them using similarity measures calculated on the documents' full texts. For further details on the experiments please refer to the paper:
@article{helmers2019automating,
title={Automating the search for a patent's prior art with a full text similarity search},
author={Helmers, Lea and Horn, Franziska and Biegler, Franziska and Oppermann, Tim and M{\"u}ller, Klaus-Robert},
journal={{PLoS ONE}},
volume={14},
number={3},
pages={e0212103},
year={2019},
publisher={Public Library of Science}
}
All the data sets needed for reproducing the analyses are available at:
https://figshare.com
and can be downloaded in a compressed format after sign-up
- SQLite database-file:
https://figshare.com/articles/Patent_Database/7264733
- Patent scoring by expert and corpus subsample:
https://figshare.com/articles/human_eval_tar_gz/7257215
- Entire corpus:
https://figshare.com/articles/corpus_tar_gz/7257194
- Adapt the seed patents in the main functions in patentcollector.py
python patentcollector.py
- Save your patent files as .csv files with following metadata as columns: ['id', 'title', 'category', 'pub_number', 'app_number', 'pub_date', 'abstract', 'description', 'claims', 'cited_patents', 'pub_dates']
- Adapt the path in the main function of make_patent_db.py to point to the directory containing your patent files
python make_patent_db.py
Check out the category distributions in your corpus
python compare_cats.py
python idf_regression.py
python kpca.py
python lat_sem_ana.py
python word2vec_app.py
python doc2vec.py
You are free to use the content of this repository under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.