Skip to content
This repository has been archived by the owner on Jan 14, 2020. It is now read-only.

Latest commit

 

History

History
84 lines (65 loc) · 4.74 KB

README.md

File metadata and controls

84 lines (65 loc) · 4.74 KB

Issue and Project Recommendation System for a GitHub Newcomer

The project is a content-based filtering approach for suggesting tasks and projects to GitHub newcomers.


2019.12.03 Update - Preliminary Results

  • Update cosine similarity 6.1 and 8.5 to calculate recommendation scores using known commit authors to validate
  • Commits used for validation seems off, TODO: recalculate

2019.12.02 Update

  • Add 6.1. Cosine Similarity (including VSM) to recommend issues for users.

2019.11.26 Update

  • Add 1.4. User_Textual_Data_Extraction.py to extract users' textual history records.
  • Add 5.3. User_TF-IDF.py to apply TF-IDF for users. See results
  • Add 5.4.1, 5.4.2 Build profiles for users and issues. See results

2019.11.24 Update

  • Fix preprocessing for issues_text; escape characters were being removed and fixed in commit
  • TF-IDF results for title, body, and title-body. See results These were calculate separately for weighting purposes
  • Todo incoporate commit documents

2019.11.18 Update

  • Simplified the code.
  • For 1.1., now we also collected "referenced commits" with issues.
  • Update K-Means, Decision Tree and Random Forest.
  • Add TF-IDF analysis for issues.
Data Description
all_issues_REPO-NAME.json list of repository issues (also includs it's pr, commits) that are bug fixes and/or "easy pick"
users_REPO-NAME.json list of users from issues that are bug fixes and/or "easy pick" (THIS FILE MAY BE TOO HUGE TO OPEN)
users_REPO-NAME_filtered.json filtered user json file. include different ages with same user.
data_users_REPO-NAME_ready_to_analysis.csv csv format file of users_REPO-NAME_filtered.json
data_users_cluster_with_results.csv K-Means result table
issue_text_REPO-NAME.json textual content of each issue

2019.11.17 Update

  • Simplified the code.
  • For 1.1., now only concern users who submit pr and commits whth related the issues with "Easy Pick" label
  • For 1.1., now collect all issue data and its related pr and commits data in order to save needed time when future usage.
  • For 1.2., now collect users' whole data in order to save needed time when future usage.
  • For 1.3., simplified the user data extraction process.
  • For 2.1. and 2.1.2., modified the code to fit latest version of data files.

2019.11.12 Update

  • For 1.1., modify the process logic in order to reduce the time needed.
  • For 2.1., add column "newcomer" in order to verify the newcomer.
  • For 3.1., modify KMeans in order to get more accuracy clustering result.
  • For 3.2., add "Silhouette Analysis" to determine the number of clusters.
  • For 4.1., move Decision Tree to this file.
  • For 4.2., move Random Forest to this file.

Next:

  • Finalising how many cluster we need to use.
  • Starting issue classify.

2019.11.05 Update

2019.11.04 Update

  • Rewrite data extraction and user extration in order to get more data and increase predict precision.
  • Added "User Classification" file to predict newcomer.
  • Saved "User Decision Tree Model" and "User Random Forest Model" files for future usage.
  • Symfony data set, MSR 14 https://github.com/symfony/symfony

2019.10.31 Update

  • Rewrite data extraction (in order to get more data)

Next:

  • Getting user data and train user model to determine what charateristics that newcomers should have.

2019.10.25 Update

  • Create Python 3.7 environment for data analysis and process.
  • Filter the features which may be useful.
  • Dataset: MSR 2014
  • The IDE I use: PyCharm

License

No license. All rights reserved by Jonathan Lam (@jonlamca) and Lance Wang (@ycpss91303).