Skip to content

Conversation

phalem
Copy link
Contributor

@phalem phalem commented Jun 29, 2023

Add all papyrus dataset that have pchembl only from: https://doi.org/10.4121/16896406.v3 To understand columns means in details look at README inside reference link. Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link: https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz size : 105 MB
Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible and revise the columns as well.
Example include:
What is the this mention at?
what is the of the or on ?
what <activity_type> of the reported on ? Ka for example.

Please, if possible it need some enhancement ,
@MicPie Can you help me in this ?
Data was large. Hugging face raise a problem when loading using load_dataset.

For 60 Million datapoint We will need to check each compound either active or not as I found compound that doesn't have pchembl is inactive. However I didn't search on all data and other data. I will see away to do that.

Thank you.

Add all papyrus dataset that have pchembl  only from: https://doi.org/10.4121/16896406.v3
To understand columns means in details look at README inside reference link.
Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link:
https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz
size : 105 MB
Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible.
@phalem phalem mentioned this pull request Jun 30, 2023
@MicPie MicPie self-requested a review July 26, 2023 13:13
@MicPie
Copy link
Contributor

MicPie commented Jul 26, 2023

Hi @phalem thank you for looking into the Papyrus data, this looks very interesting!

For this dataset you used the data from https://data.4tu.nl/file/ca10bf7d-f508-4d54-9c9a-5a9e9c1adef9/36feebfc-4703-4290-90f2-f3e41261f0c4 right?
If, we don't have to go over the HF Hub route at all, or maybe I'm missing something?

PS: I just merged with the latest main and applied the pre-commit hooks.

@MicPie
Copy link
Contributor

MicPie commented Jul 26, 2023

Ok, I'm currently trying to get the data from the direct source but the data is very big and the transform.py script needs a lot of RAM. Let's see how this works out. Depending on that we can discuss how we best approach that.
But this seems to be a great and big dataset! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants