-
-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of Triplets #76
Comments
Hey, Just looking at the trainer actually! You shouldn't be getting more than 8M triplets, as the defaults are to mine 10 hard negative example per query (which you're doing), and also ensure that each query has a maximum number of 20 triplets 🤔. Could you share your code? It's possible that the |
Sure here is the code:
|
Thank you, I'll try to figure out the exact issue soon! In the meantime, I see that you're doing this for French. If useful, you might want to check out https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR which is a ColBERT also initiated off CamemBERT-base and trained on MMARCO French split. It was trained with the upstream ColBERT codebase so should be plug&play with RAGatouille! |
When I print the generated triplets it appears to be duplicates 🤔 |
Thank you! It's not immediately obvious what the issue is, but this helps diagnosing a lot... I have a few potential ideas of what the issue could be, but need to look into it deeper... Although this has already allowed me to spot a related-bug which could cause duplicates to appear, but shouldn't cause the total number of entries to go up 🤔 (when |
Progress will be in #78 |
Hey, so #78 should resolve (at least partially!) the duplicates issue. As for the order of magnitude issue,I have loaded the data using your code above, and there's 39780811* (3 978 0811) pairs, so it'd make sense that you end up with rouahly ~40M triplets? Or do vou run some extra processing on the pairs to reduce them to 400k? |
Thank you for your responsiveness :) |
I assume the issue is somewhere else I just reviewed the prepare_training_data you do remove duplicates 🤔 I am trying your fix :) |
We are good it works perfectly now we end up with 4M triplets (I previously said we should end up with 45M but in fact it is 4.5M) and no duplicates! Thank you for the fix Benjamin :) |
No worries, glad your issue is fixed and thanks for the debugging assistance! I’ll release the PR later today on Pypi so it also includes the main branch fix for proper shuffling pre-training (right now in the branch there are some cases where triplets aren’t shuffled properly and it makes training less efficient because of in batch negatives)
|
Merged in 0.0.6b0! With the shuffling & duplicate fixes 😄 |
Hello Benjamin,
Again thank you for this amazing work! :)
There is something I do not understand: I have 400K pairs of (query, positive) from MSMarco but when I create the training dataset with the hard mining equal to 10 I get 40M triplets I do not understand how? Do you have an explanation?
Thank you!
The text was updated successfully, but these errors were encountered: