Skip to content

samtools pileup hell

Isaac Overcast edited this page Dec 9, 2015 · 5 revisions

Just a place to take notes on this fucked up problem. Samtools will output a 'pileup' format that looks like this:

    1       3008421 A       3       ^W.^/,^/,       B;8
    1       3008422 G       4       ..,,    ?F;8
    1       3008423 A       4       ..,,    AF;8
    1       3008424 A       11      ..,,^J.^W.^W.^W.^W.^W.^J.       DE;8EEEEEEE
    1       3008425 G       11      ..,,.......     HF;8HHHHHHH
    ....
    1       3008463 T       7       .,,.... AGB8A8>
    1       3008464 G       6       .,,.$..$        3GB8<8
    1       3008465 T       4       ,,..    :@8>
    1       3008466 T       4       ,$,$.$.$        7<85

Where '^' indicates the start of a new sequence and '$' indicates the termination of a sequence. Samtools Docs have more info if you want to know what the rest means. Anyway, I'm having this problem where stacks of reads aren't all getting terminating symbols generated for them in the pileup, which makes it REALLY hard to translate a pileup back into a stack of sequences.

Here's a line from the sam file. If you see the sixth field is the 'cigar string' which describes how the sequence aligned. 46M6S means 46 matches and 6 soft-clipped bases (non-matches). 40M means 40 matches, so perfect match.

    551_2028_941_F3Lane2    0       1       3008420 29      1S35M   *       0       0       AGAGAAAGTTCCATGGGGTGCTGAGAAGAAGGTATA    GHGDHGHHHHHHGGGGHFFFDDBBB@@@<<<;;;;8    NM:i:1  AS:i:32
    337_814_398_F3Lane3     0       1       3008421 54      2S46M8S *       0       0       AAAGAAGGTTCCATGGGGTGCTGAGAAGAAGGTATATTCTTTTGTGTTAGTTTTGG        HHHHHHHHHHHHHHHHHGGGGHHHHHHGEGFF@@@;BBBDDFDBDAEA>A@@@>>>        NM:i:0  AS:i:46
    238_462_1097_F3Lane4    0       1       3008421 54      40M     *       0       0       AGAAGCTTCCATGGGGTGCTGAGAAGAAGGTATATTCTTT        BFFEFFHHHHHHGGGHHEEEBABBBBGGGGGGEAEEB>>>        NM:i:1  AS:i:37
    100_1166_812_F3Lane3    16      1       3008421 14      46M6S   *       0       0       AGAAGGTTCCATGGGGTGCTGAGAAGAAGGTATATTCTTTTGTGTTAGGATG    ;;;;;;8888BBEEEHAEEEFFFFFHFGGGHFGGGHFGGGHGGGGHFGGGHE    NM:i:0  AS:i:46

Now, if you look at the transition in the pileup from the 4th to the 3rd to last lines:

    1       3008463 T       7       .,,.... AGB8A8>
    1       3008464 G       6       .,,.$..$        3GB8<8

You see something conspicuous. The number of aligned reads goes from 7 to 6, but there is no terminating $ in the sequence string on the top line. WTF?!?!?

My hunch is that mpileup doesn't know what to do if a read terminates in a soft-clipped region, so it behaves non-deterministically.

Fixed. samtools mpileup -Q 0, sites with base quality lower than 13 are filtered by default. If you get unlucky and your start/end token are filtered then that's your tough luck. Set to zero to see all bases.

Clone this wiki locally