-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using multiple repositories #9
Comments
In that case, I suggest that bodega supports both Additionally, |
Yes, I think we should have a --repo that can actually take either a single or a list of repositories, just like --accounts can be a single or multiple accounts. I also very much like the other ideas suggested by @AlexandreDecan above. |
But our model trained based on repository-user pair. Is it correct to predict an account based on comments from several repositories? |
You can check for this. I don't see any major reason why it would fail. Human comments will be more various, and bot comments are likely to be more similar. |
But increasing the number of repositories could increase the number of patterns for bots. |
But also their number of considered comments. The best you can do is to check for this based on the ground truth dataset: download extra comments for some accounts and see if the model is still somewhat reliable ;-) |
Would it be possible to run the tool on multiple repositories at once? And I do not mean, running the tool on each of these repositories separately, and have an output for each of them separately. What I mean is to consider the set of considered repositories as some kind of single virtual repositories, by considering all accounts having contributed to at least one of these repositories. If an account has contributed to more than one repositories, then all PR comments and all issue comments associated to this account will be considered by the tool, regardless of which repository it comes from. Like this, even if an account was less active in some repository, it will be easier to get to the minimum threshold (number of comments) required to classify an account. In addition, given that the number of comments for each account will likely be higher, it might further improve the accuracy of the analysis.
I think that this kind of "use case" is relevant for software projects that tend to break down their development into multiple separate repositories, but that still may wish to do an analysis for the software project as a whole, considering all its repositories together.
The text was updated successfully, but these errors were encountered: