Skip to content
Kody Moodley edited this page Apr 13, 2021 · 1 revision

DoConA pipeline overview

A python script for measuring the overlap or agreement between the text similarity and citation / reference relationships between an input set of documents.

Required input data:
  • A set of textual documents in .txt format. Each file should be named with a unique identifier representing that document. E.g. 10345.txt.
  • A CSV file called citations.csv with two columns: 1st column contains the unique ID of a citing or referring document. 2nd column contains the unique ID of the document which is cited by or referred to by the document in column 1.
  • A CSV file called sample.csv with 1 column which contains the unique IDs of a sample from the full set of documents. DoConA will be run on these input documents.
  • Optional: A CSV file called stopwords.csv which contains a list of words which should be removed from each text document during the preprocessing phase of the DoConA pipeline. The file should contain exactly one column with no header and each word should appear on a new line in the file.
Folder structure:

The arrangement of the files in your folder should look like the following image:

Text similarity algorithms used by DoConA:

Output data:

DoConA will generate a results.csv file which looks like this:

The file contains the following five columns:

  • source_document specifying a unique document identifier
  • similar_document another unique document identifier representing a document which is similar to source_document
  • similarity_score a floating point number between 0 and 1 representing the degree to which source_document and similar_document are similar according to a particular text similarity measure
  • method a name for the text similarity measure used to generate the number in similarity_score
  • citation_link a boolean value which specifies whether source_document also cites similar_document (or vice versa) in the citation network of the input documents

Please see main README of this repo for instructions on how to run the pipeline.