Skip to content

gaolabtools/LongFUSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongFUSE

LongFUSE is a toolkit to identify fusion gene with single-cell Nanopore long-read RNA-sequencing data. LongFUSE can also detect potential isoforms of the two fusion transcript fragments. LongFUSE breaks down chimeric reads to prevent false positives results, computes fusion genes with speedy exclusive OR (XOR) logic gate computing, identifies break points, and detects isoforms of fusion transcript.

Keywords

Fusion gene, Isoform, Single cell, Nanopore, RNA sequencing, Long read

Installation

The LongFUSE pipeline is built with python3. We recommend users to use anaconda/miniconda virtual environment to install it. Refer to Anaconda turorial for environment building.

Build python3 virtual environment

  • Example codes for creating and activating python3 environment on Linux-based OS:
    conda create -n LongFUSE python=3 numpy scipy
    source activate LongFUSE
    

Install LongFUSE and dependencies

  • The LongFUSE requires the following dependencies to work:

    - biopython 1.80
    - h5py 3.8.0
    - pandas 2.1.4
    - pyranges 0.1.1
    - pysam 0.19.0
    
  • Example codes for obtaining LongFUSE from GitHub and installation of dependencies:

    git clone https://github.com/gaolabtools/LongFUSE/
    cd LongFUSE
    pip3 install -r requirements.txt
    

Prepare reference genome and annotations

  • Reference genome
    Users can obtain reference genome from NCBI, Ensembl, or any other autorities

    wget https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
    
  • Gene annotation (GTF/GFF)
    Users can obtain reference gene annotations from NCBI, Ensembl, or any other autorities

    wget https://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
    
    • Note: Many tools cannot use compressed GTF file. Please try to gunzip compress GTF.gz beforehand.
  • Index reference genome for minimap2
    Prepare indexed genome for minimap2 to boost mapping. Refer to the Minimap2 instruction.

    • Example code:
      minimap2 -x map-ont -d example/GRCh38_chr22.mmi example/GRCh38_chr22.fa.gz
      

Extract high-softclipping result if needed

This script is used to extract high-softclipping reads from BAM file(s). The result high-softclipping BAM file is essential input for LongFUSE. You can do first round read mapping with the prepared genome mentioned above to generate BAM file(s).

  • Manual of extract_hs.py
Usage: extract_hs.py [options]

Options:
-h, --help            show this help message and exit
-i I_DIR              Folder name for input bam file(s). Default: input_dir
-o O_DIR              Output directory name. Default: longFuse_res
-t NCORES             Number of cores for computing. Default: 1
--log=LOG_F_NAME      Log file name. Default: extract_hs.log.txt
--inc_bed=INC_BED     Include specific regions (BED) in file. Default: None
--exc_bed=EXC_BED     Exclude specific regions (BED) in file. Default: None
--softclipping_thr=SOFTCLIPPING_THR
                      Threshold for softclipping. Default: 0.8
--samtools=SAMTOOLS   Path to samtools. Default: samtools
--bam_suf=BAM_SUF     Suffix of bam files. Default: .bam
--hs_bam_suf=HS_BAM_SUF
                      Suffix of bam files containing high-softclipping
                      reads. Default: .high_softclipping.bam

Using LongFUSE

The main script longFuse.py is used to handle everything.

  • Manual of longFuse.py
Usage: longFuse.py [options]

Options:
-h, --help            show this help message and exit
-i I_DIR              Temporary folder name. Default: fq_source
-o O_DIR              Output directory name. Default: longFuse_res
-t NCORES             Number of cores for program running. Default: 1
--ref_genome=REF_GENOME
                      * Required ! File for reference genome.
--idx_genome=IDX_GENOME
                      Path to the Minimap2 genome index. Program will use
                      reference genome if no Minimap2 genome index given.
                      Default: None
--gtf=GTF             * Required ! GTF file for expression calling.
--gig=GIG             List of gene-inside-gene. Default: None

Parameters for filtering:
  Caution! Change parameters here could give different result.

  --xor_cov_thr=XOR_COV_THR
                      Threshold for alignment coverage of XOR logic gate.
                      Default: 0.7
  --bp_dis_thr=BP_DIS_THR
                      Threshold for range of breakpoint. Default: 30
  --min_exon_cov=MIN_EXON_COV
                      Minimal exon coverage for exon calling. Default: 0.9   
  --min_cell_no=MIN_CELL_NO
                      Minimal cell number to support fusion events. Set to 0 
                      to turn off this parameter. Default: 3
  --allow_nonchr=ALLOW_NONCHR
                      Report any fusion gene from contig or scaffold.
                      Default: None
  --allow_chrM=ALLOW_CHRM
                      Report any fusion gene from chrM. Default: None
  --allow_nc=ALLOW_NC
                      Report fusion events from non-coding genes. Default:   
                      None
  --softclipping_thr=SOFTCLIPPING_THR
                      Threshold for softclipping. Default: 0.8
  --m_conf_score=M_CONF_SCORE
                      Confident score threshold for multi-alignments.
                      Default: 0.5
  --cell_freq=CELL_FREQ
                      Minimal cellular frequency to support fusion events.   
                      Set to 0 to turn off this parameter. Default: None
  --adapt_gp_no=ADAPT_GP_NO
                      Adaptive cell number gating by gene pair number.
                      Default: 500
  --strand=STRAND     Strandedness of reads. (first / second / all) Default: 
                      first
  --a5=ADAPTOR_FIVE_P
                      Sequence of 5'-adaptor. Default:
                      AAGCAGTGGTATCAACGCAGAGTACAT
  --a3=ADAPTOR_THREE_P
                      Sequence of 3'-adaptor. Default:
                      CTACACGACGCTCTTCCGATCT
  --pT=POLYT          Reverse-complement sequence of polyA. Default:
                      TTTTTTTTTTTT
  --min_read_length=MIN_READ_LENGTH
                      Minimal read length. Default: 200
  --penalty_matching=DP_MA
                      Dynamic programming matching penalty. Default: 2
  --penalty_mismatching=DP_MI
                      Dynamic programming mismatching penalty. Default: -3   
  --penalty_gap_opening=DP_GO
                      Dynamic programming gap opening penalty. Default: -5   
  --penalty_gap_extention=DP_GE
                      Dynamic programming gap extention penalty. Default: -2 
  --alignment_threshold=ALN_THR
                      Threshold for aligning adaptor. Default: 0.4

Miscellaneous parameters:
  Parameters for file path, prefixes or suffixes.

  --o_name=O_NAME     Read count with break points table name. Default:
                      matrix_fusion.tsv.gz
  --iso_name=ISO_NAME
                      Read count with isoform table name. Default:
                      matrix_fusion.isoform.tsv.gz
  --gp_name=GP_NAME   Gene pairs counting table name. Default:
                      matrix_fusion.gene_pairs.tsv.gz
  --mr_name=MR_NAME   Table for fusion gene supporting reads. Default:
                      fusion_reads.tsv.gz
  --fusion_b=FUSION_B
                      Suffix of longFuse input BAM. Default:
                      .high_softclipping.bam
  --single_suf=SINGLE_SUF
                      Suffix of non-chimeric singleton read files. Will
                      automatically append fastq sam bam during processing.  
                      Default: .singleton
  --fusion_o=FUSION_O
                      Suffix of longFuse output table. Default:
                      .fusion.tsv.gz
  --fusion_l=FUSION_L
                      Suffix of longFuse log file. Default: .fusion.log
  --read_o=READ_O     FastA output of fusion reads. Default: .fusion.fasta   
  --CB_file=CB_FILE   File name for filtered barcode list. Default:
                      filtered_barcode_list.txt
  --CB_list=CB_LIST   File name for merged cell barcodes. Default:
                      CB_merged_list.tsv.gz
  --CB_count=CB_COUNT
                      Cell barcode counting file name. Default:
                      CB_counting.tsv.gz
  --CB_count_rows=CB_COUNT_ROWS
                      Read first rows of cell barcode counting file.
                      Default: 20000
  --log=LOG_F_NAME    Log file name. Default: reporter_fusion.log.txt
  --count_log=COUNT_LOG
                      Counting log file name. Default:
                      reporter_fusion.counting_log.txt
  --minimap2=MINIMAP2
                      Path to minimap2. Default: minimap2
  --samtools=SAMTOOLS
                      Path to samtools. Default: samtools
  --bam_suf=BAM_SUF   Suffix of bam files. Default: .bam
  --hs_bam_suf=HS_BAM_SUF
                      Suffix of bam files containing high-softclipping
                      reads. Default: .high_softclipping.bam

Please refer to the example script "run_longFuse.sh" for your convenience. In the example script, the longFuse directory is located at ~/data/LongFUSE, and the reference genome is located at ~/genome/refdata-gex-GRCh38-2024-A. The script asks for 20 cores to run the task, and leaves all the rest parameters by default.

P_DIR="~/data/LongFUSE"
REF_GENOME="~/genome/refdata-gex-GRCh38-2024-A/fasta/genome.fa"
IDX_GENOME="~/genome/refdata-gex-GRCh38-2024-A/fasta/refdata-gex-GRCh38-2024-A.mmi"
GTF="~/genome/refdata-gex-GRCh38-2024-A/genes/genes.gtf"
GIG="~/genome/exc_gene_list.tsv"
ncores=20

python3 ${P_DIR}/extract_hs.py -t $ncores -i tmp &> run_extract_hs.py
python3 ${P_DIR}/longFuse.py -t $ncores -i tmp --ref_genome ${REF_GENOME} --idx_genome ${IDX_GENOME} --gtf ${GTF} --gig ${GIG} &> run_longFuse.log.txt

Make sure to update the path inside the script with any text editor you like, then run the script with the following command.

sh run_longFuse.sh

Results of LongFUSE:

The LongFUSE generate two single cell matrices and one tabular file as results.

  1. Single cell gene expression matrix: matrix_fusion.gene_pairs.tsv.gz
    This file documents single cell gene expression matrix. Each row contains one pairs of fusion gene. The first 4 columns contains gene ID 1, gene ID 2, gene name 1, and gene name 2. The cell barcodes are started from the fifth column all the way to the end. Here's an example result:
gene1	gene2	gene_name_1	gene_name_2	TCTTCCTTCATAGAGA	CCACCATAGACATAAC	CATCGCTAGAAGCTCG	CCTCCAAGTTCCCAAA
ENSG00000167034	ENSG00000167751	NKX3-1	KLK2	2	0	0	0
ENSG00000112561	ENSG00000166897	TFEB	ELFN2	0	1	0	0
ENSG00000169710	ENSG00000167751	FASN	KLK2	0	0	1	0
ENSG00000184012	ENSG00000157554	TMPRSS2	ERG	0	0	0	2
  1. Single cell isoform matrix
    This file documents single cell isoform matrix. This table exhaustively lists all the potential isoform combinations. Each row contains one pairs of fusion gene under one potential isoform. The first 4 columns contain gene ID 1, gene ID 2, gene name 1, and gene name 2. The fifth and sixth columns contain potential transcript ID 1 and transcript ID 2. The seventh columns documents all the covered exon IDs. Any exon partially covered by the read is labeled with a suffix "p". The cell barcodes are started from the eighth column all the way to the end. Here's an example result:
gene1	gene2	gene_name_1	gene_name_2	tx1	tx2	exons	TCTTCCTTCATAGAGA	CCACCATAGACATAAC	CATCGCTAGAAGCTCG	CCTCCAAGTTCCCAAA
ENSG00000184012	ENSG00000157554	TMPRSS2	ERG	ENST00000332149	ENST00000288319	ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p	0	0	0	2
ENSG00000184012	ENSG00000157554	TMPRSS2	ERG	ENST00000677680	ENST00000288319	ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p	0	0	0	2
ENSG00000184012	ENSG00000157554	TMPRSS2	ERG	ENST00000678171	ENST00000288319	ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p	0	0	0	2
ENSG00000184012	ENSG00000157554	TMPRSS2	ERG	ENST00000678743	ENST00000288319	ENSE00003508916_ENSE00003536380_ENSE00003712731_ENSE00001296629p	0	0	0	2
  1. Transcript level tabular form This table records all the information regarding to the read. Each one row contains one read. The columns contain information about cell barcode, gene ID 1, gene name 1, fusion break point 1, gene ID 2, gene name 2, fusion break point 2, and read ID. Here's an example result:
CCTCCAAGTTCCCAAA	ENSG00000143797	MBOAT2	chr2:8858402	ENSG00000164506	STXBP5	chr6:147373631	c03961f4-1878-416e-a569-0c4af71a6930_GCTTGTAGCAAT
CCTCCAAGTTCCCAAA	ENSG00000157554	ERG	chr21:38445624	ENSG00000184012	TMPRSS2	chr21:41507948	6a3936f6-41ac-4f83-9fb0-e217e366b45e_CAGGGTTCTACC
CCTCCAAGTTCCCAAA	ENSG00000157554	ERG	chr21:38445624	ENSG00000184012	TMPRSS2	chr21:41507948	b2f920ae-ad5c-444f-b093-6c3c6c7bedbc_CAGGGTTCTACC
CCTCCAAGTTCCCAAA	ENSG00000160216	AGPAT3	chr21:43984388	ENSG00000168237	GLYCTK	chr3:52287827	aaa0292d-00fd-4ddb-9fa7-84589c4b77f2_TGGGCCTTCCAT
  1. Status counting table This counting table documents the read number after step-by-step filters. Here's an example result:
CB	Raw_HS_reads	Non_chimeric_reads	Chimeric_reads_bi_segments	Chimeric_reads_tri_segments	Chimeric_reads_other	Rescued_reads	Reads_with_true_CB	Reads_from_ambient	Singleton_reads	HS_reads	Reads_pass_prelim_filtering	Reads_concordant	Reads_pass_XOR	Reads_aftering_cellular_filtering
CATACTTCACGTTGGC	5866	2738	2079	237	812	2553	6545	1062	6452	3076	3075	2169	33	2
CCGTAGGAGTATTGCC	1860	835	626	50	349	726	1942	295	1918	947	947	631	25	0
AGTCATGCACCCAACG	3160	1486	1151	128	395	1407	3583	589	3508	1667	1667	1080	28	1
TGTTACTAGTAGACCG	1899	849	687	96	267	879	2156	355	2053	906	903	677	28	0

Explanation of the columns:

  • CB: cell barcode
  • Raw_HS_reads: Raw high-softclipping read number
  • Non_chimeric_reads: The number of non-chimeric reads
  • Chimeric_reads_bi_segments: The number of chimeric reads containing two singleton reads
  • Chimeric_reads_tri_segments: The number of chimeric reads containing three singleton reads
  • Chimeric_reads_other: The number of chimeric reads containing discordant read architecture
  • Rescued_reads: The number of singleton reads
  • Reads_with_true_CB: The number of singleton reads having true cell barcode
  • Reads_from_ambient: The number of singleton reads not having true cell barcode
  • Singleton_reads: The number of singleton reads having no high-softclipping alignment
  • HS_reads: The number of singleton reads having high-softclipping alignment
  • Reads_pass_prelim_filtering: The number of singleton high-softclipping reads passed pre-filtering
  • Reads_concordant: The number of singleton high-softclipping reads having concordant alignment
  • Reads_pass_XOR: The number of singleton high-softclipping reads passed XOR, i.e. fusion transcript candidates
  • Reads_aftering_cellular_filtering: The number of fusion transcript candidates passed cellular filtering

About

LongFUSE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •