Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some problems for RxRx3 #9

Open
liujunhongznn opened this issue Jul 17, 2023 · 0 comments
Open

some problems for RxRx3 #9

liujunhongznn opened this issue Jul 17, 2023 · 0 comments

Comments

@liujunhongznn
Copy link

Thanks for your good job on RxRx3! what a good dataset for HCS tasks! I‘m downloading RxRx3 dataset and have done some analysis on the data, here are some of my problems:

problems for COMPOUND data

  1. I found that for the data (rows) that the "SMILES" column is NAN, the "treatment" column would be CRISPR_control or EMPTY_control, what's the meaning of CRISPR_control and EMPTY_control? and what's the difference?
  2. what's the meaning of the string after the smiles? (e.g. string "|c:11,13,23,29,32,t:1,9,26|" in CC1=C(C(CC(=O)N1)C1=CC=C(C=C1)C(F)(F)F)C(=O)NC1=C(F)C=C2NN=CC2=C1 |c:11,13,23,29,32,t:1,9,26|)
  3. how to define the label if I want to use this dataset for supervised pre-training/classification task?using SMILES/treatment as label directly? or using SMILES/treatment under different concentrations as label?
  4. some error may occur in compound003/Plate36.tar, some images (png files) are missed while could be found in the meta csv
  5. there are 95,701 wells (well_id) in the meta csv for compound data, while totally 220,800 wells (well_id) in the embedding parquet files, how to understand the superfluous embeddings of wells (220,800 - 95,701) in the embedding parquet files?
  6. what model was used for the extraction of the embeddings? could the model be publicly available? which dataset was used for the training of the model? and what's the size of your training set?

problems for CRISPR data

  1. how to define the label if I want to use this dataset for supervised pre-training/classification task?using gene as label directly? or using treatment as label?
  2. what's the meaning of "EMPTY_control" in the gene and treatment columns?
  3. the notation of the treatment is "gene name_guide number", what's the meaning of "guide number"? (e.g. RXRX3-79420_guide_10)
  4. Plate2.tar, Plate3.tar, Plate5.tar in gene017 experiment has downloading error, it would be interrupted in the progress of downloading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant