Updates for regenerating text #83

MinaAlmasi · 2024-12-18T09:47:49Z

I would suggest the following steps for when we want to rerun the text generation with new models:

Do before re-generating

Update vLLM and other necessary packages, so we can also update the Python version.

Currently running everything with coder Python 1.87.2 on UCloud which has Python 3.10. There have been 9 updates since then to the UCloud App.
Look into whether vLLM have added a "min_token" parameter.

Currently, I compute the length of strings and re-generate in a for loop for n = 20 times to avoid getting generations below the desired amount of tokens for each task. There is no need for this hacky solution if there is a built-in solution now.
Consider using entire model-names instead of the current short-hands

E.g. use stabilityai/StableBeluga-7B or StableBeluga-7B instead of beluga7b
Remove model names as prefixes for completions column

Back when I started the project, I somehow thought it was a good idea to add the model name as prefix to the "beluga7b_completions" which I ultimately remove in the folder make_dataset to standardise formats across models. It should just be called completions

Remove HF pipeline

At the time of coding, I also created the possibility of using the HF interface also. I think for simplicity we should remove this. There is no need for the scope of the project (esp. if we want to split the repos at some point).
Run embeddings with smaller model

I used nvidia/NV-Embed-v2 because it scored the highest on MTEB, but it is a heavy model - is it overkill for a baseline? I changed from FP32 to FP16 precision to make it less memory hungry, and could run it with a batch-size of 16 on the new nvidia L40 gpus

The text was updated successfully, but these errors were encountered: