You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know this may not be the right forum for this question, but I tried posting this in biostars and the mailing list, but I haven't heard back so I'm putting it here as well.
I'm hoping to run a segmentation using quite a few tracks derived from Hi-C across the human genome. Half of these will be continuous with values ranging from 0-0.15 and the other half will be binary discrete with values 0 and 1. Both sets of tracks are very sparse with many 0 regions. I have a few questions on the best way to approach this.
Will the discrete and continuous tracks both be treated exactly the same or can I specify within Segway which tracks are continuous and discrete?
Do I need to (or how could I best) normalize the values in some way to avoid the binary 0,1 tracks from drowning out the lower scored 0-0.15 tracks?
Should I be training on certain regions of the genome that I know have a signal? I'm worried that only including 5% or less in the minibatch training may not pick up all the variations in the tracks due to the sparsity of the data.
Thanks for any guidance.
The text was updated successfully, but these errors were encountered:
I've not seen any correspondence on the mailing list, nor in my spam so this is something we may need to look into!
Here are the answers I can give to the best of my ability:
Discrete and continuous data are both modeled with a continuous probability distribution. In theory this really should not be an issue.
No normalization should be necessary. The input is modeled with a probability distribution and each track gets it's own distribution. Tracks are evenly weighted by default. Also to note all input is by default is normalized with an arcsinh transformation within a given track.
Data input selection is more of an art than science and here I would likely suggest you should pick regions that you know have known data associated with them, especially if its truly sparse and you're concerned that random region selection will simply "miss" most of the time. If it's Hi-C data you could for example, find some candidate promoter regions for your genome to start with.
@corinsexton I am going to mark this as closed unless I hear otherwise. Additionally I cannot find any mention in the mailing list archives about this issue. Let me know if you used the address in the link because if you did it is something we need to remedy right away.
I know this may not be the right forum for this question, but I tried posting this in biostars and the mailing list, but I haven't heard back so I'm putting it here as well.
I'm hoping to run a segmentation using quite a few tracks derived from Hi-C across the human genome. Half of these will be continuous with values ranging from 0-0.15 and the other half will be binary discrete with values 0 and 1. Both sets of tracks are very sparse with many 0 regions. I have a few questions on the best way to approach this.
Thanks for any guidance.
The text was updated successfully, but these errors were encountered: