Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Preparation Script #29

Open
mmderakhshani opened this issue Sep 10, 2024 · 3 comments
Open

Dataset Preparation Script #29

mmderakhshani opened this issue Sep 10, 2024 · 3 comments

Comments

@mmderakhshani
Copy link

Hi, thanks for this excellent work.

Do you mind if I ask you to release the dataset preparation part, such as downloading the dataset and recaptioning it using ShareGPT4v? I could not find them in the repo.

Thanks.

@Sierkinhane
Copy link
Collaborator

Sierkinhane commented Sep 11, 2024

Hi, unfortunately, the datasets are not prepared by ourselves. You can try this download tool to download large-scale datasets as tars, which can be directly loaded using our code.

We do use ShareGPT4V to recaption the data and the prompt is Analyze the image in a comprehensive and detailed manner.

@mmderakhshani
Copy link
Author

Thanks for replying back indeed! I knew this repo as it is trying to provide dataset in webdataset format. However, when I was checking the project config file, I have noticed that ‘laion-auestics-12m’ is used during pretraining. When googling, I could not find this dataset and that is why I have asked for it because without this data we cannot reproduce your experiments. Thanks!

@Sierkinhane
Copy link
Collaborator

Sierkinhane commented Sep 11, 2024

Hi, laion-aesthetics-12m is available here https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants