-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misclassified cases #5
Comments
I didn't understand the second option and in the third option, it can only improve the result for that specific user. But the first one seems to be feasible. Could you please elaborate on this idea a bit more? |
I think it's better to ask users (or to have a semi-automztic way to do it) to report those misclassified cases, so we can add them to the training set and release a new version of the tool with an improved model. From a reusability point of view, it's better to improve the model rather than having a list of "edge cases" whose target class override the one of the model. |
We already mention in the README that users can report misclassified cases to us if they find any. To have a semi-automatic way, would it be feasible to add support in the tool itself to report misclassified cases to us? I do not immediately see how to. |
Since an API key has to be provided, we can add a subcommand to report about invalid cases (e.g. enter usernames that are misclassified in a given repository) and that automatically open an issue in this repository with them? I'm not convinced we need something like this, since we can simply ask/expect/hope users to report misclassified cases "manually". |
It is probably too positive to think that people will report misclassified cases manually, just because it is mentioned in our readme |
Any built-in possibility to report misclassified cases as Github issues will require a second execution of the tool (since it is not interactive, and it won't be given we want to keep it as a reusable CLI). Why a second execution is needed? Because we should be able to reproduce the example, so we need the exact set of comments that were considered by the model (or, at least, the exact set of features that were considered for that specific case). One "easy" possibility to do so would be to add an extra "--report" flag, accepting a list of accounts that are misclassified, e.g., if the tool was run with Btw, doing all of this manually could be very time-consuming for us, but if it's the case, we can still try to implement all these steps as part of a CI (e.g. let's dream of a bot we would develop, that downloads the comments, compute the features and prediction, and posts all of this in the corresponding issue, so that one of us can "confirm" the misclassified case by putting a "confirmed" label on the issue, and then the CI rebuilds the model and pushes it on the repository, with an incremented version of bodega and a tag for the new release). But honestly, given the work all of this represents, I think it's too much for a "research tool" ;) |
Notice we can ask a student to do this (e.g., as a M1 project). |
Yes, looks like an interesting master student project to pursue. Let' try that. If you want, you can close this issue for now (or leave it open until we have a worling implementation, but this can take quite a while). |
In those rare cases where bodega misclassifies a human as bot, or a bot as human, it would be nice to have a way to actually make bodega aware of this, to avoid having the tool report this issue over and over again. I do not know what would be the best way to achieve this, but I can see different possibilities: whenever a misclassification is found, it could be marked as such (in some file with a specific format and filename), and when the tool is run, it checks in the file for the misclassification. It will be then up to the user of bodega to decide whether to include the misclassified accounts when re-running bodega.
Where should such a file be stored? Different solutions can be envisioned:
(1) On the bodega GitHub repository itself, we could have a file containing all known misclassifications (i.e. all cases that have been reported to us, and verified by us, of accounts that were misclassified when running bodega). When running bodega, this file can then be consulted to report the correct classification of the account.
(2) On the GitHub repository that is being analysed by bodega. Again, when running bodega, this file can then be consulted to report the correct classification of the account.
(3) In the directory of the user that is actually using bodega to run the analysis (e.g. if that user does not have write access to be able to use solution (2) and if that user does not want to share the misclassification for whatever reason).
If we want to combine these multiple solutions, we should probably set a precedence order.
The text was updated successfully, but these errors were encountered: