Skip to content

Data directory

Hk-tang edited this page Jan 21, 2021 · 11 revisions

The data directory of this repository contains seven files.

  • answer_comments.png and answer_edits.png

    These two PNG files present graphs of how many answers have x number of comments or edits. They were used to motivate our decision of selecting thresholds when creating a sample of answers for the ground truth.

  • example_result.json

    This file contains the comment-edit pairs of a single answer in JSON format (4 comment-edit pairs). This JSON object has the same keys as the columns of the results.csv that is generated by the program. It is just an example to show the json format.

  • general_precision.json

    This JSON file contains the 1,910 comment-edit pairs that was used in the paper. These 1,910 pairs are the ones that were manually evaluated by the two authors, which is why this contains values that are not automatically evaluated by the program (e.g., tangled, useful, etc.). This is a Json packaging of the pairs that can be used in subsequent programs. These 1,910 comment-edit pairs were randomly sampled from the results of the program on each language tag, i.e., after running the program on each language tag of Java, JavaScript, Android, Python, and Php individually, we randomly selected 382 comment-edit pairs from each result set.

  • ground_truth.csv

    • This CSV file is the 100 answers (20 from Java, JavaScript, Android, Php, and Python) used in the ground truth of the paper. However, this file has less information compared to the files in the ground_truth directory of the results zip. This is because only these columns are used for the eval command line option of the program. Note that the eval command line option only evaluates the results.csv based on the answer ids and comment ids in the ground_truth.csv.

    • The ground_truth.csv contains four columns: AnswerIds, CommentIds, EditIds, and Useful. AnswerIds are the post ids of the answers to evaluate. CommentIds are the ids of the comments on each answer. EditIds are the manual evaluations of which comments and edits are paired. Useful is a column containing the manual evaluations of whether the comment-edit pair is useful. If useful, this is denoted by the word yes.

  • pull_requests.csv

    This CSV file contains information regarding the 15 pull requests described in the paper.

  • threshold_comparison.csv

    This CSV file contains the comparisons of the different fuzzywuzzy thresholds. This threshold is the one that is used when determining if a comment and edit share code, which is how we connect comments and edits together. In the paper and in the config.ini this threshold is set to 90%. This csv shows the results of the thresholds from 50%-100% in 10% increments.

Clone this wiki locally