PLEASE NOTE: Snaptron by Wilks et al. is a query tool for making sense of splicing across thousands of RNA-seq samples. It subsumes intropolis v1. If you are looking for the raw data behind Snaptron, see http://snaptron.cs.jhu.edu/data/. In particular, the SQLite files comprising an "intropolis v2" compilation spanning ~50,000 RNA-seq samples on SRA are available at http://snaptron.cs.jhu.edu/data/srav2/.
intropolis
is a list of exon-exon junctions found across 21,504 human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA. Five files are provided:
A. intropolis.v1.hg19.tsv.gz : a 6.6-GB gzipped TSV (18.3 GB uncompressed) with fields
- chromosome
- intron start position (1-based; inclusive)
- intron end position (1-based; inclusive)
- strand (+ or -)
- donor dinucleotide (e.g., GT)
- acceptor dinucleotide (e.g., AG)
- comma-separated list of indexes of samples in which junction was found
- comma-separated list of corresponding numbers of reads mapping across junction in samples from field 7
B. intropolis.idmap.v1.hg19.tsv : a small TSV with fields
- sample index used in field 7 of
intropolis.v1.hg19.tsv.gz
- SRA project accession number
- SRA sample accession number
- SRA experiment accession number
- SRA run accession number
C. intropolis.v1.hg19.bed.gz : a gzipped BED-formatted version of intropolis.v1.hg19.tsv.gz
with fields
- chromosome
- intron start position (0-based; inclusive)
- intron end position (0-based; exclusive)
- name (
junction_[line number]
) - score (always
1000
) - strand (+ or -)
D. intropolis.v1.hg19.bb : a bigBed version of intropolis.v1.hg19.bed.gz
E. intropolis.v1.hg19_with_liftover_to_hg38.tsv.gz : a 6.87-GB gzipped TSV (18.3 GB uncompressed) with fields
- chromosome
- intron start position (1-based; inclusive)
- intron end position (1-based; inclusive)
- strand (+ or -)
- donor dinucleotide (e.g., GT)
- acceptor dinucleotide (e.g., AG)
- comma-separated list of indexes of samples in which junction was found
- comma-separated list of corresponding numbers of reads mapping across junction in samples from field 7
- chromosome from liftover to hg38 or NA if unavailable
- start position in liftover to hg38 or NA if unavailable
- end position in liftover to hg38 or NA if unavailable
- strand in liftover to hg38 or NA if unavailable
(If the links above don't work for you, check out the backup on Figshare.)
Liftover of junctions to hg38 in intropolis.v1.hg19_with_liftover_to_hg38.tsv.gz
was performed with the UCSC liftOver
executable with command-line parameters -ends=2 -minMatch=1.0
and may be reproduced using this script together with intropolis.v1.hg19.tsv.gz
.
Metadata on SRA specifying e.g. tissue and cell type is incomplete and does not have a controlled vocabulary. Some is available in this file derived from the fantastic SRAdb
R package by Jack Zhu and Sean Davis. Still more metadata taken from Biosample is available in this file. But probably the best effort to infer metadata for SRA RNA-seq (with a controlled vocabulary for tissues!) is SHARQ, by Darya Filippova while in Carl Kingsford's group.
Expect new versions of intropolis
spanning more samples as they are added to SRA. If you use intropolis
, cite Human splicing diversity across the Sequence Read Archive, by
- Abhi Nellore
- Andrew E. Jaffe
- Jean-Philippe Fortin
- José Alquicira-Hernández
- Leonardo Collado-Torres
- Siruo Wang
- Robert A. Phillips III
- Nishika Karbhari
- Kasper D. Hansen
- Ben Langmead
- Jeffrey T. Leek