Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using multiple repositories #9

Open
tommens opened this issue Sep 1, 2020 · 6 comments
Open

using multiple repositories #9

tommens opened this issue Sep 1, 2020 · 6 comments

Comments

@tommens
Copy link
Contributor

tommens commented Sep 1, 2020

Would it be possible to run the tool on multiple repositories at once? And I do not mean, running the tool on each of these repositories separately, and have an output for each of them separately. What I mean is to consider the set of considered repositories as some kind of single virtual repositories, by considering all accounts having contributed to at least one of these repositories. If an account has contributed to more than one repositories, then all PR comments and all issue comments associated to this account will be considered by the tool, regardless of which repository it comes from. Like this, even if an account was less active in some repository, it will be easier to get to the minimum threshold (number of comments) required to classify an account. In addition, given that the number of comments for each account will likely be higher, it might further improve the accuracy of the analysis.
I think that this kind of "use case" is relevant for software projects that tend to break down their development into multiple separate repositories, but that still may wish to do an analysis for the software project as a whole, considering all its repositories together.

@AlexandreDecan
Copy link
Collaborator

AlexandreDecan commented Sep 1, 2020

In that case, I suggest that bodega supports both --repo (the default positional argument right now) and --accounts. If --repo is specified, then all accounts from given repository are considered. If --accounts is provided, then all accounts are considered (i.e. their comments are downloaded, regardless of the repositories where they were made). If both are provided, then only the specified accounts active in given repository should be considered, and only the comments within that repository are considered.

Additionally, --repo could be a list of repositories instead of a single repository. In that case, all accounts from given repositories are considered (but only comments within these repositories are downloaded and processed).

@tommens
Copy link
Contributor Author

tommens commented Sep 1, 2020

Yes, I think we should have a --repo that can actually take either a single or a list of repositories, just like --accounts can be a single or multiple accounts. I also very much like the other ideas suggested by @AlexandreDecan above.

@mehdigolzadeh
Copy link
Owner

But our model trained based on repository-user pair. Is it correct to predict an account based on comments from several repositories?

@AlexandreDecan
Copy link
Collaborator

You can check for this. I don't see any major reason why it would fail. Human comments will be more various, and bot comments are likely to be more similar.

@mehdigolzadeh
Copy link
Owner

But increasing the number of repositories could increase the number of patterns for bots.

@AlexandreDecan
Copy link
Collaborator

But also their number of considered comments. The best you can do is to check for this based on the ground truth dataset: download extra comments for some accounts and see if the model is still somewhat reliable ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants