Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildJunctions in mapping pipeline produces invalid file #157

Open
IanSudbery opened this issue Mar 23, 2022 · 1 comment
Open

buildJunctions in mapping pipeline produces invalid file #157

IanSudbery opened this issue Mar 23, 2022 · 1 comment

Comments

@IanSudbery
Copy link
Contributor

The buildJunctions function in the mapping pipeline builds a bed4 file of junctions:
contig, start, end, strand

to pass to various mappers. It does this by taking a bundle of gffs from a transcript iterator, sorting them into start order, and then walking along, creating intervals from the end of one entry to the start of the next.

for gffs in GTF.transcript_iterator(
GTF.iterator(iotools.open_file(infile, "r"))):
gffs.sort(key=lambda x: x.start)
end = gffs[0].end
for gff in gffs[1:]:
# subtract one: these are not open/closed coordinates but
# the 0-based coordinates
# of first and last residue that are to be kept (i.e., within the
# exon).
outf.write("%s\t%i\t%i\t%s\n" %
(gff.contig, end - 1, gff.start, gff.strand))
end = gff.end
njunctions += 1
outf.close()

Unfortunately this fails to take account of the fact that the gffs list will contain things other than exons (it will include both CDS and transcript entries). In particular the transcript entries will often become the first entry when sorted on starts, but their end is after the end of all other entries. This means you get invalid entries where the start is at a higher coordinate than the end. And the interval doesn't refer to a junction.

This file is used by mapReadsWithTopHat2 and mapReadsWithHisat. One assumes no one has used TopHat2 for years . However it leads Hisat2 to produce and invalid BAM file. See:

DaehwanKimLab/hisat2#365

PR incoming.

@jscaber
Copy link
Contributor

jscaber commented Mar 23, 2022

Thanks Ian!

On a related, but unaffected note:
This reminds me to make a note that the 2 pass STAR method is currently only partly implemented in the mapping pipeline.
The pipeline currently uses the junction file of each individual run, and I have now corrected this for the second run to use a joint junctions file from all junctions found with the first pass (with some minor filtering of low confidence junctions built in). I will make a PR very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants