Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using continuous and discrete tracks #161

Open
corinsexton opened this issue Feb 24, 2022 · 2 comments
Open

Using continuous and discrete tracks #161

corinsexton opened this issue Feb 24, 2022 · 2 comments

Comments

@corinsexton
Copy link

I know this may not be the right forum for this question, but I tried posting this in biostars and the mailing list, but I haven't heard back so I'm putting it here as well.

I'm hoping to run a segmentation using quite a few tracks derived from Hi-C across the human genome. Half of these will be continuous with values ranging from 0-0.15 and the other half will be binary discrete with values 0 and 1. Both sets of tracks are very sparse with many 0 regions. I have a few questions on the best way to approach this.

  • Will the discrete and continuous tracks both be treated exactly the same or can I specify within Segway which tracks are continuous and discrete?
  • Do I need to (or how could I best) normalize the values in some way to avoid the binary 0,1 tracks from drowning out the lower scored 0-0.15 tracks?
  • Should I be training on certain regions of the genome that I know have a signal? I'm worried that only including 5% or less in the minibatch training may not pick up all the variations in the tracks due to the sparsity of the data.

Thanks for any guidance.

@EricR86
Copy link
Member

EricR86 commented Feb 24, 2022

I've not seen any correspondence on the mailing list, nor in my spam so this is something we may need to look into!

Here are the answers I can give to the best of my ability:

  • Discrete and continuous data are both modeled with a continuous probability distribution. In theory this really should not be an issue.
  • No normalization should be necessary. The input is modeled with a probability distribution and each track gets it's own distribution. Tracks are evenly weighted by default. Also to note all input is by default is normalized with an arcsinh transformation within a given track.
  • Data input selection is more of an art than science and here I would likely suggest you should pick regions that you know have known data associated with them, especially if its truly sparse and you're concerned that random region selection will simply "miss" most of the time. If it's Hi-C data you could for example, find some candidate promoter regions for your genome to start with.

Hope this helps!

@EricR86
Copy link
Member

EricR86 commented Mar 9, 2022

@corinsexton I am going to mark this as closed unless I hear otherwise. Additionally I cannot find any mention in the mailing list archives about this issue. Let me know if you used the address in the link because if you did it is something we need to remedy right away.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants