Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About trainRef #13

Open
sneha1201 opened this issue Jun 6, 2022 · 8 comments
Open

About trainRef #13

sneha1201 opened this issue Jun 6, 2022 · 8 comments

Comments

@sneha1201
Copy link

Hello,

Can anyone help me with these terms with more explanation like from where this Ref set come from and how to do it.

trainNucl trainRef

Thanks in advance .

@rstraver
Copy link
Collaborator

rstraver commented Jun 7, 2022

trainNucl refers to the file produced at the end of the paragraph just above in the README:

The final result should contain the samplename in the first column, then followed by only read start counts per position in every column after that.

I admit my description could have been clearer.

I believe trainRef is a space separated file with at least these columns, created by getRefFF.py:
sampleID fetalFraction gender
Plus another 2 optional value columns that should indicate median coverage on autosomal chromosomes and median coverage on chromosome X (per bin).

sampleID: Path to a sample file where the sample id is part of the filename.
fetalFraction: Estimated fraction of fetal DNA based on X or Y chromosomal DNA fragments.
gender: Either Male, Female or BAD (last option is a warning output from getRefFF.py to ignore the sample here).

That said, I haven't touched this repository (or work) in years (and I don't have any data for this anymore) so if you run into some odd error I'm unlikely able to reproduce it. Hence, be warned that you probably need some decent understanding of the algorithm and scripting experience to get this to run properly on your end.

@sneha1201
Copy link
Author

Thanks for the reply.

Could you please advice me with getRefFF.py , how to use it over my reference set ?

@rstraver
Copy link
Collaborator

rstraver commented Jun 7, 2022

Look like you can adopt getRefFF.sh to run it on the data that came out of ./prepSamples.sh.
Probably needs to store the stdout to a file using > in the shell command.

@sneha1201
Copy link
Author

In getRefFF.sh what will be the argument use for the input file .

i have given the file generated from ./prepSamples.sh with script , but its not taking .

could you suggest me how to give my input files argument ?

and if i'm using python script using python 2.7
File "getRefFF.py", line 53, in
print sys.argv[1],sys.argv[1].split("/")[-1],((medA-medX)/medA*2)*100,medA,medX
ZeroDivisionError: float division by zero

i'm getting this error.

could you please guide me with the same.

@sneha1201
Copy link
Author

after resolving all the issues and running getReFF.sh script . i got the below result .

./GM01_3C.bam GM01_3C.bam 200.0 23.0 0.0
./GM01_7C.bam GM01_7C.bam 200.0 72.0 0.0
./GM01_16C.bam GM01_16C.bam 200.0 59.0 0.0
./GM01_19C.bam GM01_19C.bam 200.0 45.0 0.0
./GM01_1C.bam GM01_1C.bam 200.0 37.0 0.0
./GM01_14C.bam GM01_14C.bam 200.0 33.0 0.0
./GM01_18C.bam GM01_18C.bam 200.0 76.0 0.0
./GM01_12C.bam GM01_12C.bam 200.0 49.0 0.0
./GM01_6C.bam GM01_6C.bam 200.0 53.0 0.0
./GM01_17C.bam GM01_17C.bam 200.0 44.0 0.0
./GM01_13C.bam GM01_13C.bam 200.0 40.0 0.0
./GM01_9C.bam GM01_9C.bam 200.0 33.0 0.0
./GM01_2C.bam GM01_2C.bam 200.0 29.0 0.0
./GM01_15C.bam GM01_15C.bam 200.0 30.0 0.0
./GM01_4C.bam GM01_4C.bam 200.0 128.0 0.0
./GM01_21C.bam GM01_21C.bam 200.0 114.0 0.0
./GM01_8C.bam GM01_8C.bam 200.0 64.0 0.0
./GM01_5C.bam GM01_5C.bam 200.0 35.0 0.0
./GM01_23C.bam GM01_23C.bam 200.0 40.0 0.0
./GM01_20C.bam GM01_20C.bam 200.0 130.0 0.0
./GM01_22C.bam GM01_22C.bam 200.0 52.0 0.0
./GM01_10C.bam GM01_10C.bam 200.0 63.0 0.0

will you please help me in understanding this result.
what are these values for.

@rstraver
Copy link
Collaborator

It's produced by this line in the script:

print sys.argv[1],sys.argv[1].split("/")[-1],((medA-medX)/medA*2)*100,medA,medX

so:

  1. full path to input file
  2. input filename (easier to match sample IDs sometimes)
  3. fetal fraction (as calculated from the next 2)
  4. median reads per bin on autosomal chromosomes
  5. median reads per bin on chromosome X

The calculation basically compares chrX coverage with autosomal chromosome coverages and the difference would indicate the fetal fraction. However, that only works if the training samples were all pregnancies with a male fetus (can't spot a difference on chrX coverage otherwise).
In your case is appears it didn't find any coverage on chrX at all, suggesting the script didn't find any data mapped to chrX possibly due to chrX not existing in your reference genome or having a different name, often due to my script assuming a chrX or just X format where others went with the other option.

Perhaps you can share the header of the .bam file so we can check what reference sequence names you have?

If this step is resolved, odds are you need to make some edits on the table before feeding it into the next script but first this "missing X" issue should be resolved.

@sneha1201
Copy link
Author

after resolving all the issues and running getReFF.sh script . i got the below result .

./GM01_3C.bam GM01_3C.bam 200.0 23.0 0.0 ./GM01_7C.bam GM01_7C.bam 200.0 72.0 0.0 ./GM01_16C.bam GM01_16C.bam 200.0 59.0 0.0 ./GM01_19C.bam GM01_19C.bam 200.0 45.0 0.0 ./GM01_1C.bam GM01_1C.bam 200.0 37.0 0.0 ./GM01_14C.bam GM01_14C.bam 200.0 33.0 0.0 ./GM01_18C.bam GM01_18C.bam 200.0 76.0 0.0 ./GM01_12C.bam GM01_12C.bam 200.0 49.0 0.0 ./GM01_6C.bam GM01_6C.bam 200.0 53.0 0.0 ./GM01_17C.bam GM01_17C.bam 200.0 44.0 0.0 ./GM01_13C.bam GM01_13C.bam 200.0 40.0 0.0 ./GM01_9C.bam GM01_9C.bam 200.0 33.0 0.0 ./GM01_2C.bam GM01_2C.bam 200.0 29.0 0.0 ./GM01_15C.bam GM01_15C.bam 200.0 30.0 0.0 ./GM01_4C.bam GM01_4C.bam 200.0 128.0 0.0 ./GM01_21C.bam GM01_21C.bam 200.0 114.0 0.0 ./GM01_8C.bam GM01_8C.bam 200.0 64.0 0.0 ./GM01_5C.bam GM01_5C.bam 200.0 35.0 0.0 ./GM01_23C.bam GM01_23C.bam 200.0 40.0 0.0 ./GM01_20C.bam GM01_20C.bam 200.0 130.0 0.0 ./GM01_22C.bam GM01_22C.bam 200.0 52.0 0.0 ./GM01_10C.bam GM01_10C.bam 200.0 63.0 0.0

will you please help me in understanding this result. what are these values for.

but i got this result after running getREFF script

@rstraver
Copy link
Collaborator

I believe you, but the values in the output suggests something went wrong and my first guess is a mismatch between the refseq names assumed by the script and what's actually in your data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants