LongFUSE is a toolkit to identify fusion gene with single-cell Nanopore long-read RNA-sequencing data. LongFUSE can also detect potential isoforms of the two fusion transcript fragments. LongFUSE breaks down chimeric reads to prevent false positives results, computes fusion genes with speedy exclusive OR (XOR) logic gate computing, identifies break points, and detects isoforms of fusion transcript.
Fusion gene, Isoform, Single cell, Nanopore, RNA sequencing, Long read
The LongFUSE pipeline is built with python3. We recommend users to use anaconda/miniconda virtual environment to install it. Refer to Anaconda turorial for environment building.
- Example codes for creating and activating python3 environment on Linux-based OS:
conda create -n LongFUSE python=3 numpy scipy source activate LongFUSE
-
The LongFUSE requires the following dependencies to work:
- biopython 1.80 - h5py 3.8.0 - pandas 2.1.4 - pyranges 0.1.1 - pysam 0.19.0 -
Example codes for obtaining LongFUSE from GitHub and installation of dependencies:
git clone https://github.com/gaolabtools/LongFUSE/ cd LongFUSE pip3 install -r requirements.txt
-
Reference genome
Users can obtain reference genome from NCBI, Ensembl, or any other autoritieswget https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -
Gene annotation (GTF/GFF)
Users can obtain reference gene annotations from NCBI, Ensembl, or any other autoritieswget https://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz- Note: Many tools cannot use compressed GTF file. Please try to gunzip compress GTF.gz beforehand.
-
Index reference genome for minimap2
Prepare indexed genome for minimap2 to boost mapping. Refer to the Minimap2 instruction.- Example code:
minimap2 -x map-ont -d example/GRCh38_chr22.mmi example/GRCh38_chr22.fa.gz
- Example code:
This script is used to extract high-softclipping reads from BAM file(s). The result high-softclipping BAM file is essential input for LongFUSE. You can do first round read mapping with the prepared genome mentioned above to generate BAM file(s).
- Manual of extract_hs.py
Usage: extract_hs.py [options]
Options:
-h, --help show this help message and exit
-i I_DIR Folder name for input bam file(s). Default: input_dir
-o O_DIR Output directory name. Default: longFuse_res
-t NCORES Number of cores for computing. Default: 1
--log=LOG_F_NAME Log file name. Default: extract_hs.log.txt
--inc_bed=INC_BED Include specific regions (BED) in file. Default: None
--exc_bed=EXC_BED Exclude specific regions (BED) in file. Default: None
--softclipping_thr=SOFTCLIPPING_THR
Threshold for softclipping. Default: 0.8
--samtools=SAMTOOLS Path to samtools. Default: samtools
--bam_suf=BAM_SUF Suffix of bam files. Default: .bam
--hs_bam_suf=HS_BAM_SUF
Suffix of bam files containing high-softclipping
reads. Default: .high_softclipping.bam
The main script longFuse.py is used to handle everything.
- Manual of longFuse.py
Usage: longFuse.py [options]
Options:
-h, --help show this help message and exit
-i I_DIR Temporary folder name. Default: fq_source
-o O_DIR Output directory name. Default: longFuse_res
-t NCORES Number of cores for program running. Default: 1
--ref_genome=REF_GENOME
* Required ! File for reference genome.
--idx_genome=IDX_GENOME
Path to the Minimap2 genome index. Program will use
reference genome if no Minimap2 genome index given.
Default: None
--gtf=GTF * Required ! GTF file for expression calling.
--gig=GIG List of gene-inside-gene. Default: None
Parameters for filtering:
Caution! Change parameters here could give different result.
--xor_cov_thr=XOR_COV_THR
Threshold for alignment coverage of XOR logic gate.
Default: 0.7
--bp_dis_thr=BP_DIS_THR
Threshold for range of breakpoint. Default: 30
--min_exon_cov=MIN_EXON_COV
Minimal exon coverage for exon calling. Default: 0.9
--min_cell_no=MIN_CELL_NO
Minimal cell number to support fusion events. Set to 0
to turn off this parameter. Default: 3
--allow_nonchr=ALLOW_NONCHR
Report any fusion gene from contig or scaffold.
Default: None
--allow_chrM=ALLOW_CHRM
Report any fusion gene from chrM. Default: None
--allow_nc=ALLOW_NC
Report fusion events from non-coding genes. Default:
None
--softclipping_thr=SOFTCLIPPING_THR
Threshold for softclipping. Default: 0.8
--m_conf_score=M_CONF_SCORE
Confident score threshold for multi-alignments.
Default: 0.5
--cell_freq=CELL_FREQ
Minimal cellular frequency to support fusion events.
Set to 0 to turn off this parameter. Default: None
--adapt_gp_no=ADAPT_GP_NO
Adaptive cell number gating by gene pair number.
Default: 500
--strand=STRAND Strandedness of reads. (first / second / all) Default:
first
--a5=ADAPTOR_FIVE_P
Sequence of 5'-adaptor. Default:
AAGCAGTGGTATCAACGCAGAGTACAT
--a3=ADAPTOR_THREE_P
Sequence of 3'-adaptor. Default:
CTACACGACGCTCTTCCGATCT
--pT=POLYT Reverse-complement sequence of polyA. Default:
TTTTTTTTTTTT
--min_read_length=MIN_READ_LENGTH
Minimal read length. Default: 200
--penalty_matching=DP_MA
Dynamic programming matching penalty. Default: 2
--penalty_mismatching=DP_MI
Dynamic programming mismatching penalty. Default: -3
--penalty_gap_opening=DP_GO
Dynamic programming gap opening penalty. Default: -5
--penalty_gap_extention=DP_GE
Dynamic programming gap extention penalty. Default: -2
--alignment_threshold=ALN_THR
Threshold for aligning adaptor. Default: 0.4
Miscellaneous parameters:
Parameters for file path, prefixes or suffixes.
--o_name=O_NAME Read count with break points table name. Default:
matrix_fusion.tsv.gz
--iso_name=ISO_NAME
Read count with isoform table name. Default:
matrix_fusion.isoform.tsv.gz
--gp_name=GP_NAME Gene pairs counting table name. Default:
matrix_fusion.gene_pairs.tsv.gz
--mr_name=MR_NAME Table for fusion gene supporting reads. Default:
fusion_reads.tsv.gz
--fusion_b=FUSION_B
Suffix of longFuse input BAM. Default:
.high_softclipping.bam
--single_suf=SINGLE_SUF
Suffix of non-chimeric singleton read files. Will
automatically append fastq sam bam during processing.
Default: .singleton
--fusion_o=FUSION_O
Suffix of longFuse output table. Default:
.fusion.tsv.gz
--fusion_l=FUSION_L
Suffix of longFuse log file. Default: .fusion.log
--read_o=READ_O FastA output of fusion reads. Default: .fusion.fasta
--CB_file=CB_FILE File name for filtered barcode list. Default:
filtered_barcode_list.txt
--CB_list=CB_LIST File name for merged cell barcodes. Default:
CB_merged_list.tsv.gz
--CB_count=CB_COUNT
Cell barcode counting file name. Default:
CB_counting.tsv.gz
--CB_count_rows=CB_COUNT_ROWS
Read first rows of cell barcode counting file.
Default: 20000
--log=LOG_F_NAME Log file name. Default: reporter_fusion.log.txt
--count_log=COUNT_LOG
Counting log file name. Default:
reporter_fusion.counting_log.txt
--minimap2=MINIMAP2
Path to minimap2. Default: minimap2
--samtools=SAMTOOLS
Path to samtools. Default: samtools
--bam_suf=BAM_SUF Suffix of bam files. Default: .bam
--hs_bam_suf=HS_BAM_SUF
Suffix of bam files containing high-softclipping
reads. Default: .high_softclipping.bam
Please refer to the example script "run_longFuse.sh" for your convenience. In the example script, the longFuse directory is located at ~/data/LongFUSE, and the reference genome is located at ~/genome/refdata-gex-GRCh38-2024-A. The script asks for 20 cores to run the task, and leaves all the rest parameters by default.
P_DIR="~/data/LongFUSE"
REF_GENOME="~/genome/refdata-gex-GRCh38-2024-A/fasta/genome.fa"
IDX_GENOME="~/genome/refdata-gex-GRCh38-2024-A/fasta/refdata-gex-GRCh38-2024-A.mmi"
GTF="~/genome/refdata-gex-GRCh38-2024-A/genes/genes.gtf"
GIG="~/genome/exc_gene_list.tsv"
ncores=20
python3 ${P_DIR}/extract_hs.py -t $ncores -i tmp &> run_extract_hs.py
python3 ${P_DIR}/longFuse.py -t $ncores -i tmp --ref_genome ${REF_GENOME} --idx_genome ${IDX_GENOME} --gtf ${GTF} --gig ${GIG} &> run_longFuse.log.txt
Make sure to update the path inside the script with any text editor you like, then run the script with the following command.
sh run_longFuse.sh
The LongFUSE generate two single cell matrices and one tabular file as results.
- Single cell gene expression matrix: matrix_fusion.gene_pairs.tsv.gz
This file documents single cell gene expression matrix. Each row contains one pairs of fusion gene. The first 4 columns contains gene ID 1, gene ID 2, gene name 1, and gene name 2. The cell barcodes are started from the fifth column all the way to the end. Here's an example result:
gene1 gene2 gene_name_1 gene_name_2 TCTTCCTTCATAGAGA CCACCATAGACATAAC CATCGCTAGAAGCTCG CCTCCAAGTTCCCAAA
ENSG00000167034 ENSG00000167751 NKX3-1 KLK2 2 0 0 0
ENSG00000112561 ENSG00000166897 TFEB ELFN2 0 1 0 0
ENSG00000169710 ENSG00000167751 FASN KLK2 0 0 1 0
ENSG00000184012 ENSG00000157554 TMPRSS2 ERG 0 0 0 2
- Single cell isoform matrix
This file documents single cell isoform matrix. This table exhaustively lists all the potential isoform combinations. Each row contains one pairs of fusion gene under one potential isoform. The first 4 columns contain gene ID 1, gene ID 2, gene name 1, and gene name 2. The fifth and sixth columns contain potential transcript ID 1 and transcript ID 2. The seventh columns documents all the covered exon IDs. Any exon partially covered by the read is labeled with a suffix "p". The cell barcodes are started from the eighth column all the way to the end. Here's an example result:
gene1 gene2 gene_name_1 gene_name_2 tx1 tx2 exons TCTTCCTTCATAGAGA CCACCATAGACATAAC CATCGCTAGAAGCTCG CCTCCAAGTTCCCAAA
ENSG00000184012 ENSG00000157554 TMPRSS2 ERG ENST00000332149 ENST00000288319 ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p 0 0 0 2
ENSG00000184012 ENSG00000157554 TMPRSS2 ERG ENST00000677680 ENST00000288319 ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p 0 0 0 2
ENSG00000184012 ENSG00000157554 TMPRSS2 ERG ENST00000678171 ENST00000288319 ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p 0 0 0 2
ENSG00000184012 ENSG00000157554 TMPRSS2 ERG ENST00000678743 ENST00000288319 ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p 0 0 0 2
- Transcript level tabular form This table records all the information regarding to the read. Each one row contains one read. The columns contain information about cell barcode, gene ID 1, gene name 1, fusion break point 1, gene ID 2, gene name 2, fusion break point 2, and read ID. Here's an example result:
CCTCCAAGTTCCCAAA ENSG00000143797 MBOAT2 chr2:8858402 ENSG00000164506 STXBP5 chr6:147373631 c03961f4-1878-416e-a569-0c4af71a6930_GCTTGTAGCAAT
CCTCCAAGTTCCCAAA ENSG00000157554 ERG chr21:38445624 ENSG00000184012 TMPRSS2 chr21:41507948 6a3936f6-41ac-4f83-9fb0-e217e366b45e_CAGGGTTCTACC
CCTCCAAGTTCCCAAA ENSG00000157554 ERG chr21:38445624 ENSG00000184012 TMPRSS2 chr21:41507948 b2f920ae-ad5c-444f-b093-6c3c6c7bedbc_CAGGGTTCTACC
CCTCCAAGTTCCCAAA ENSG00000160216 AGPAT3 chr21:43984388 ENSG00000168237 GLYCTK chr3:52287827 aaa0292d-00fd-4ddb-9fa7-84589c4b77f2_TGGGCCTTCCAT
- Status counting table This counting table documents the read number after step-by-step filters. Here's an example result:
CB Raw_HS_reads Non_chimeric_reads Chimeric_reads_bi_segments Chimeric_reads_tri_segments Chimeric_reads_other Rescued_reads Reads_with_true_CB Reads_from_ambient Singleton_reads HS_reads Reads_pass_prelim_filtering Reads_concordant Reads_pass_XOR Reads_aftering_cellular_filtering
CATACTTCACGTTGGC 5866 2738 2079 237 812 2553 6545 1062 6452 3076 3075 2169 33 2
CCGTAGGAGTATTGCC 1860 835 626 50 349 726 1942 295 1918 947 947 631 25 0
AGTCATGCACCCAACG 3160 1486 1151 128 395 1407 3583 589 3508 1667 1667 1080 28 1
TGTTACTAGTAGACCG 1899 849 687 96 267 879 2156 355 2053 906 903 677 28 0
Explanation of the columns:
- CB: cell barcode
- Raw_HS_reads: Raw high-softclipping read number
- Non_chimeric_reads: The number of non-chimeric reads
- Chimeric_reads_bi_segments: The number of chimeric reads containing two singleton reads
- Chimeric_reads_tri_segments: The number of chimeric reads containing three singleton reads
- Chimeric_reads_other: The number of chimeric reads containing discordant read architecture
- Rescued_reads: The number of singleton reads
- Reads_with_true_CB: The number of singleton reads having true cell barcode
- Reads_from_ambient: The number of singleton reads not having true cell barcode
- Singleton_reads: The number of singleton reads having no high-softclipping alignment
- HS_reads: The number of singleton reads having high-softclipping alignment
- Reads_pass_prelim_filtering: The number of singleton high-softclipping reads passed pre-filtering
- Reads_concordant: The number of singleton high-softclipping reads having concordant alignment
- Reads_pass_XOR: The number of singleton high-softclipping reads passed XOR, i.e. fusion transcript candidates
- Reads_aftering_cellular_filtering: The number of fusion transcript candidates passed cellular filtering