Finding a patent's prior art using text similarity

This repository contains research work on finding prior art for a given patent. The approach is to find the most similar documents for a given patent application by comparing them using similarity measures calculated on the documents' full texts. For further details on the experiments please refer to the paper:

@article{helmers2019automating,
  title={Automating the search for a patent's prior art with a full text similarity search},
  author={Helmers, Lea and Horn, Franziska and Biegler, Franziska and Oppermann, Tim and M{\"u}ller, Klaus-Robert},
  journal={{PLoS ONE}},
  volume={14},
  number={3},
  pages={e0212103},
  year={2019},
  publisher={Public Library of Science}
}

All the data sets needed for reproducing the analyses are available at: https://figshare.com and can be downloaded in a compressed format after sign-up

SQLite database-file: https://figshare.com/articles/Patent_Database/7264733
Patent scoring by expert and corpus subsample: https://figshare.com/articles/human_eval_tar_gz/7257215
Entire corpus: https://figshare.com/articles/corpus_tar_gz/7257194

Compile dataset and load it into sqlite database

Crawling patent files from google patents

Adapt the seed patents in the main functions in patentcollector.py

python patentcollector.py

Create SQLite DB

Save your patent files as .csv files with following metadata as columns: ['id', 'title', 'category', 'pub_number', 'app_number', 'pub_date', 'abstract', 'description', 'claims', 'cited_patents', 'pub_dates']
Adapt the path in the main function of make_patent_db.py to point to the directory containing your patent files

python make_patent_db.py

Exploratory data analysis

Evaluate Corpus statistics

Check out the category distributions in your corpus

python compare_cats.py

Run similarity search

The different feature extraction methods:

Bag-of-words with tf-idf

python idf_regression.py

Kernel-PCA

python kpca.py

Latent semantic analysis (LSA)

python lat_sem_ana.py

Word2vec

python word2vec_app.py

Doc2vec

python doc2vec.py

LICENSE

You are free to use the content of this repository under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
database		database
patentcrawler		patentcrawler
.gitattributes		.gitattributes
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
calc_and_plot_d2v_full.py		calc_and_plot_d2v_full.py
calc_dataset_stats.py		calc_dataset_stats.py
cat_stats.py		cat_stats.py
compare_all_pairs_scores.py		compare_all_pairs_scores.py
compare_cats.py		compare_cats.py
corpus_utils.py		corpus_utils.py
db_patent_stats.py		db_patent_stats.py
doc2vec.py		doc2vec.py
evaluate_simcoefs.py		evaluate_simcoefs.py
evaluate_simcoefs_humanscores.py		evaluate_simcoefs_humanscores.py
follow_up_analyses.ipynb		follow_up_analyses.ipynb
get_baseline_auc.py		get_baseline_auc.py
idf_regression.py		idf_regression.py
idf_regression_entire_corpus.py		idf_regression_entire_corpus.py
kpca.py		kpca.py
lat_sem_ana.py		lat_sem_ana.py
make_section_corpus.py		make_section_corpus.py
plot_simcoef_distr.py		plot_simcoef_distr.py
plot_utils.py		plot_utils.py
train_and_calc_w2v.py		train_and_calc_w2v.py
wmd_pats.py		wmd_pats.py
word2vec.py		word2vec.py
word2vec_app.py		word2vec_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finding a patent's prior art using text similarity

Compile dataset and load it into sqlite database

Crawling patent files from google patents

Create SQLite DB

Exploratory data analysis

Evaluate Corpus statistics

Run similarity search

The different feature extraction methods:

Bag-of-words with tf-idf

Kernel-PCA

Latent semantic analysis (LSA)

Word2vec

Doc2vec

LICENSE

About

Releases

Packages

Languages

helmersl/patent_similarity_search

Folders and files

Latest commit

History

Repository files navigation

Finding a patent's prior art using text similarity

Compile dataset and load it into sqlite database

Crawling patent files from google patents

Create SQLite DB

Exploratory data analysis

Evaluate Corpus statistics

Run similarity search

The different feature extraction methods:

Bag-of-words with tf-idf

Kernel-PCA

Latent semantic analysis (LSA)

Word2vec

Doc2vec

LICENSE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages