Number of Triplets #76

nico2rdj · 2024-01-27T15:05:02Z

Hello Benjamin,

Again thank you for this amazing work! :)
There is something I do not understand: I have 400K pairs of (query, positive) from MSMarco but when I create the training dataset with the hard mining equal to 10 I get 40M triplets I do not understand how? Do you have an explanation?

Thank you!

bclavie · 2024-01-27T15:11:38Z

Hey,

Just looking at the trainer actually! You shouldn't be getting more than 8M triplets, as the defaults are to mine 10 hard negative example per query (which you're doing), and also ensure that each query has a maximum number of 20 triplets 🤔.

Could you share your code? It's possible that the pairs pathway could accidentally be generating too many negatives!

nico2rdj · 2024-01-27T15:35:18Z

Sure here is the code:

 def run():

     print("load dataset")
     dataset = load_dataset('unicamp-dl/mmarco', 'french')


     pairs = []
     for data in tqdm(dataset['train']):
    

        query = data['query']
        doc = data['positive']

        pairs.append((query, doc))

    trainer = RAGTrainer(model_name="colBERT", pretrained_model_name="almanach/camembert-base", language_code="fr")

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data", all_documents=None, num_new_negatives=10, mine_hard_negatives=True, hard_negative_model_size="base")

bclavie · 2024-01-27T15:40:49Z

Thank you, I'll try to figure out the exact issue soon!

In the meantime, I see that you're doing this for French. If useful, you might want to check out https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR which is a ColBERT also initiated off CamemBERT-base and trained on MMARCO French split. It was trained with the upstream ColBERT codebase so should be plug&play with RAGatouille!

nico2rdj · 2024-01-27T16:17:09Z

When I print the generated triplets it appears to be duplicates 🤔

nico2rdj · 2024-01-27T16:51:41Z

bclavie · 2024-01-27T17:00:53Z

Thank you! It's not immediately obvious what the issue is, but this helps diagnosing a lot... I have a few potential ideas of what the issue could be, but need to look into it deeper...

Although this has already allowed me to spot a related-bug which could cause duplicates to appear, but shouldn't cause the total number of entries to go up 🤔 (when extra_triplets_needed > 0, there was no check to ensure the new triplets were unique)

bclavie · 2024-01-27T17:05:34Z

Progress will be in #78

bclavie · 2024-01-27T20:13:34Z

Hey, so #78 should resolve (at least partially!) the duplicates issue.

As for the order of magnitude issue,I have loaded the data using your code above, and there's 39780811* (3 978 0811) pairs, so it'd make sense that you end up with rouahly ~40M triplets? Or do vou run some extra processing on the pairs to reduce them to 400k?

nico2rdj · 2024-01-28T12:49:28Z

Thank you for your responsiveness :)
It's clearer now. We have 39,780,811 pairs before using the prepare_training_data function, and we end up with approximately the same number, which is around 40 million triplets. It is true that, after removing duplicates, we have 457,040 pairs before the prepare_training_data function, so logically, we should end up with about 45 million unique triplets. I assume that the prepare_training_data function also removes duplicates, which is why we end up with around 40 million. However, there is (had) a peculiar issue regarding the number of duplicate triplets. I am currently running this function but by removing the duplicates beforehand.

nico2rdj · 2024-01-28T13:01:23Z

I assume the issue is somewhere else I just reviewed the prepare_training_data you do remove duplicates 🤔 I am trying your fix :)

nico2rdj · 2024-01-28T17:03:16Z

We are good it works perfectly now we end up with 4M triplets (I previously said we should end up with 45M but in fact it is 4.5M) and no duplicates! Thank you for the fix Benjamin :)

bclavie · 2024-01-28T17:11:43Z

No worries, glad your issue is fixed and thanks for the debugging assistance! I’ll release the PR later today on Pypi so it also includes the main branch fix for proper shuffling pre-training (right now in the branch there are some cases where triplets aren’t shuffled properly and it makes training less efficient because of in batch negatives)

bclavie · 2024-01-28T19:24:22Z

Merged in 0.0.6b0! With the shuffling & duplicate fixes 😄

bclavie added the question Further information is requested label Jan 27, 2024

bclavie added the ongoing Feature is currently being worked on label Jan 27, 2024

bclavie closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of Triplets #76

Number of Triplets #76

nico2rdj commented Jan 27, 2024

bclavie commented Jan 27, 2024 •

edited

Loading

nico2rdj commented Jan 27, 2024

bclavie commented Jan 27, 2024

nico2rdj commented Jan 27, 2024

nico2rdj commented Jan 27, 2024

bclavie commented Jan 27, 2024

bclavie commented Jan 27, 2024

bclavie commented Jan 27, 2024

nico2rdj commented Jan 28, 2024

nico2rdj commented Jan 28, 2024

nico2rdj commented Jan 28, 2024

bclavie commented Jan 28, 2024 via email •

edited

Loading

bclavie commented Jan 28, 2024

Number of Triplets #76

Number of Triplets #76

Comments

nico2rdj commented Jan 27, 2024

bclavie commented Jan 27, 2024 • edited Loading

nico2rdj commented Jan 27, 2024

bclavie commented Jan 27, 2024

nico2rdj commented Jan 27, 2024

nico2rdj commented Jan 27, 2024

bclavie commented Jan 27, 2024

bclavie commented Jan 27, 2024

bclavie commented Jan 27, 2024

nico2rdj commented Jan 28, 2024

nico2rdj commented Jan 28, 2024

nico2rdj commented Jan 28, 2024

bclavie commented Jan 28, 2024 via email • edited Loading

bclavie commented Jan 28, 2024

bclavie commented Jan 27, 2024 •

edited

Loading

bclavie commented Jan 28, 2024 via email •

edited

Loading