Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of Triplets #76

Closed
nico2rdj opened this issue Jan 27, 2024 · 13 comments
Closed

Number of Triplets #76

nico2rdj opened this issue Jan 27, 2024 · 13 comments
Labels
ongoing Feature is currently being worked on question Further information is requested

Comments

@nico2rdj
Copy link

Hello Benjamin,

Again thank you for this amazing work! :)
There is something I do not understand: I have 400K pairs of (query, positive) from MSMarco but when I create the training dataset with the hard mining equal to 10 I get 40M triplets I do not understand how? Do you have an explanation?

Thank you!

@bclavie
Copy link
Collaborator

bclavie commented Jan 27, 2024

Hey,

Just looking at the trainer actually! You shouldn't be getting more than 8M triplets, as the defaults are to mine 10 hard negative example per query (which you're doing), and also ensure that each query has a maximum number of 20 triplets 🤔.

Could you share your code? It's possible that the pairs pathway could accidentally be generating too many negatives!

@bclavie bclavie added the question Further information is requested label Jan 27, 2024
@nico2rdj
Copy link
Author

Sure here is the code:

 def run():

     print("load dataset")
     dataset = load_dataset('unicamp-dl/mmarco', 'french')


     pairs = []
     for data in tqdm(dataset['train']):
    

        query = data['query']
        doc = data['positive']

        pairs.append((query, doc))

    trainer = RAGTrainer(model_name="colBERT", pretrained_model_name="almanach/camembert-base", language_code="fr")

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data", all_documents=None, num_new_negatives=10, mine_hard_negatives=True, hard_negative_model_size="base")

@bclavie
Copy link
Collaborator

bclavie commented Jan 27, 2024

Thank you, I'll try to figure out the exact issue soon!

In the meantime, I see that you're doing this for French. If useful, you might want to check out https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR which is a ColBERT also initiated off CamemBERT-base and trained on MMARCO French split. It was trained with the upstream ColBERT codebase so should be plug&play with RAGatouille!

@nico2rdj
Copy link
Author

When I print the generated triplets it appears to be duplicates 🤔

@nico2rdj
Copy link
Author

image

@bclavie
Copy link
Collaborator

bclavie commented Jan 27, 2024

Thank you! It's not immediately obvious what the issue is, but this helps diagnosing a lot... I have a few potential ideas of what the issue could be, but need to look into it deeper...

Although this has already allowed me to spot a related-bug which could cause duplicates to appear, but shouldn't cause the total number of entries to go up 🤔 (when extra_triplets_needed > 0, there was no check to ensure the new triplets were unique)

@bclavie
Copy link
Collaborator

bclavie commented Jan 27, 2024

Progress will be in #78

@bclavie bclavie added the ongoing Feature is currently being worked on label Jan 27, 2024
@bclavie
Copy link
Collaborator

bclavie commented Jan 27, 2024

Hey, so #78 should resolve (at least partially!) the duplicates issue.

As for the order of magnitude issue,I have loaded the data using your code above, and there's 39780811* (3 978 0811) pairs, so it'd make sense that you end up with rouahly ~40M triplets? Or do vou run some extra processing on the pairs to reduce them to 400k?

@nico2rdj
Copy link
Author

Thank you for your responsiveness :)
It's clearer now. We have 39,780,811 pairs before using the prepare_training_data function, and we end up with approximately the same number, which is around 40 million triplets. It is true that, after removing duplicates, we have 457,040 pairs before the prepare_training_data function, so logically, we should end up with about 45 million unique triplets. I assume that the prepare_training_data function also removes duplicates, which is why we end up with around 40 million. However, there is (had) a peculiar issue regarding the number of duplicate triplets. I am currently running this function but by removing the duplicates beforehand.

@nico2rdj
Copy link
Author

I assume the issue is somewhere else I just reviewed the prepare_training_data you do remove duplicates 🤔 I am trying your fix :)

@nico2rdj
Copy link
Author

We are good it works perfectly now we end up with 4M triplets (I previously said we should end up with 45M but in fact it is 4.5M) and no duplicates! Thank you for the fix Benjamin :)

@bclavie
Copy link
Collaborator

bclavie commented Jan 28, 2024 via email

@bclavie
Copy link
Collaborator

bclavie commented Jan 28, 2024

Merged in 0.0.6b0! With the shuffling & duplicate fixes 😄

@bclavie bclavie closed this as completed Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ongoing Feature is currently being worked on question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants