Conversation
|
Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project. However having this dataset surely is useful to train experimental systems and science projects. BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned |
Hi, Thanks for your comment. Yes, I have used the cleaned version. Sadly, I didn't know about license restrictions. The dataset itself (Alapaca) is published under Apache 2.0. I have also published my dataset under Apache 2.0. Isn't that good enough? |
|
Unfortunately not, see https://github.com/gururise/AlpacaDataCleaned#license Maybe you can translate the english and the spanish Open Assistant Dataset instead? Both are quite big. |
Hi,
In the last two days, I have been working on translating alpaca into Persian (Farsi) and this is the result. I have reviewed the translations and they are in my opinion pretty good.
Also, the dataset is still translating on Kaggle and will be finished in a couple of days. I will update the datasets accordingly when the translation is complete.
I have added two datasets. One is instruction-based and one is orca-style dataset. For the first one, I knew how to add it. But I don't know how to add the orca dataset to your datasets.
Thank you for your attention.