using multiple repositories #9

tommens · 2020-09-01T16:41:13Z

Would it be possible to run the tool on multiple repositories at once? And I do not mean, running the tool on each of these repositories separately, and have an output for each of them separately. What I mean is to consider the set of considered repositories as some kind of single virtual repositories, by considering all accounts having contributed to at least one of these repositories. If an account has contributed to more than one repositories, then all PR comments and all issue comments associated to this account will be considered by the tool, regardless of which repository it comes from. Like this, even if an account was less active in some repository, it will be easier to get to the minimum threshold (number of comments) required to classify an account. In addition, given that the number of comments for each account will likely be higher, it might further improve the accuracy of the analysis.
I think that this kind of "use case" is relevant for software projects that tend to break down their development into multiple separate repositories, but that still may wish to do an analysis for the software project as a whole, considering all its repositories together.

AlexandreDecan · 2020-09-01T16:45:08Z

In that case, I suggest that bodega supports both --repo (the default positional argument right now) and --accounts. If --repo is specified, then all accounts from given repository are considered. If --accounts is provided, then all accounts are considered (i.e. their comments are downloaded, regardless of the repositories where they were made). If both are provided, then only the specified accounts active in given repository should be considered, and only the comments within that repository are considered.

Additionally, --repo could be a list of repositories instead of a single repository. In that case, all accounts from given repositories are considered (but only comments within these repositories are downloaded and processed).

tommens · 2020-09-01T16:51:41Z

Yes, I think we should have a --repo that can actually take either a single or a list of repositories, just like --accounts can be a single or multiple accounts. I also very much like the other ideas suggested by @AlexandreDecan above.

mehdigolzadeh · 2020-09-01T16:54:21Z

But our model trained based on repository-user pair. Is it correct to predict an account based on comments from several repositories?

AlexandreDecan · 2020-09-01T17:02:08Z

You can check for this. I don't see any major reason why it would fail. Human comments will be more various, and bot comments are likely to be more similar.

mehdigolzadeh · 2020-09-01T17:03:31Z

But increasing the number of repositories could increase the number of patterns for bots.

AlexandreDecan · 2020-09-01T17:58:21Z

But also their number of considered comments. The best you can do is to check for this based on the ground truth dataset: download extra comments for some accounts and see if the model is still somewhat reliable ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using multiple repositories #9

using multiple repositories #9

tommens commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020 •

edited

Loading

tommens commented Sep 1, 2020

mehdigolzadeh commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020

mehdigolzadeh commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020

using multiple repositories #9

using multiple repositories #9

Comments

tommens commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020 • edited Loading

tommens commented Sep 1, 2020

mehdigolzadeh commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020

mehdigolzadeh commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020

AlexandreDecan commented Sep 1, 2020 •

edited

Loading