Home

DoConA pipeline overview

A python script for measuring the overlap or agreement between the text similarity and citation / reference relationships between an input set of documents.

Required input data:

A set of textual documents in .txt format. Each file should be named with a unique identifier representing that document. E.g. 10345.txt.
A CSV file called citations.csv with two columns: 1st column contains the unique ID of a citing or referring document. 2nd column contains the unique ID of the document which is cited by or referred to by the document in column 1.
A CSV file called sample.csv with 1 column which contains the unique IDs of a sample from the full set of documents. DoConA will be run on these input documents.
Optional: A CSV file called stopwords.csv which contains a list of words which should be removed from each text document during the preprocessing phase of the DoConA pipeline. The file should contain exactly one column with no header and each word should appear on a new line in the file.

Folder structure:

The arrangement of the files in your folder should look like the following image:

Text similarity algorithms used by DoConA:

TF-IDF
Jaccard distance
N-grams
Pre-trained and custom word2vec word embeddings and doc2vec document embeddings

Output data:

DoConA will generate a results.csv file which looks like this:

The file contains the following five columns:

source_document specifying a unique document identifier
similar_document another unique document identifier representing a document which is similar to source_document
similarity_score a floating point number between 0 and 1 representing the degree to which source_document and similar_document are similar according to a particular text similarity measure
method a name for the text similarity measure used to generate the number in similarity_score
citation_link a boolean value which specifies whether source_document also cites similar_document (or vice versa) in the citation network of the input documents

Please see main README of this repo for instructions on how to run the pipeline.

Provide feedback

Saved searches