Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
annotate_cluster.py		annotate_cluster.py
call_mutation.py		call_mutation.py
ccc.py		ccc.py
check_chimera.py		check_chimera.py
cluster_umi.sh		cluster_umi.sh
count_base.sh		count_base.sh
map_ccs_by_minimap2.sh		map_ccs_by_minimap2.sh
match_ccs.py		match_ccs.py
parse_feature.py		parse_feature.py
pick_control_seq.py		pick_control_seq.py
run_annotate_umi.sh		run_annotate_umi.sh
select_cluster.py		select_cluster.py
utils.py		utils.py

README.md

Single Molecule Sequecing Pipeline (smrt)

Adaptors Trimming

Polymerase reads of PacBio Sequel platform were trimmed by bazbam tool according to the adapter sequence.

baz2bam version: 6.0
adapter sequence: ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT

Circular Consensus Sequence Generation

CCS reads were generated by ccs tool in stranded mode:

ccs --min-passes=1 --min-length=1000 --max-length=10000 --by-strand --min-rq=0.66 -j=30 \
  --report-file=stranded_ccs.report stranded_ccs.bam

ccs version: 4.0.0
outputfile: stranded_ccs.bam

Align to Reference Sequence

Use minimap2 tool to align CCS reads with reference sequence (3kb readout with barcode and UMI sequence).

Prepare index file first

# generate index file
minimap2 -d ref.mmi ref.fa

Convert bam format into fastq format, then run minimap2 tool with optimized arguments,

minimap2 -O 2,12 -E 2,1 --score-N 0 --end-bonus 100 -a --MD -x map-pb ref.mmi output.bam

or take the map_ccs_by_minimap2.sh script in this repo.

Annotate Sequence Features

After alignment, features of each CCS reads were extracted based on their relative locations.

As the diagram below, a fully aligned sequence contains

a pairs of libary barcodes on both ends,
a pairs sample barcodes on the inner side of the library barcode,
an (Unique molecular indentifier) UMI on the 5' side,
and a biological sequence in the center of the read.

biological sequence is opposite to technical sequence, which is introduced by PCR or ligation reaction.

And the relative position of each feature is shown in the table below.

Feature	Start	End	Direction
libary barcode	1	16	=>
sample barcode	22	38	=>
UMI	58	71	=>
BioSeq	96	3038	=>
sample barcode	3063	3078	=>
libary barcode	3084	3099	<=

You can also take the parse_feature.py script in this repo to annotate sequence features.

Link Sample Barcode Sequence

BARCODE TABLE (<- click to show more)

Individual	Organ	Barcode_ID	Barcode_Sequence
Fly-1	Ey	BC02	ACACACGCGAGACAGA
Fly-1	Br	BC03	ACGCGCTATCTCAGAG
Fly-1	Vc	BC04	CACACGCGCGCTATAT
Fly-1	L1	BC05	ACACACTCTATCAGAT
Fly-1	L2	BC06	CTATACATAGTGATGT
Fly-1	L3	BC07	TATATATGTCTATAGA
Fly-1	Wg	BC08	GAGACTAGAGATAGTG
Fly-1	Hl	BC09	ATGTGTATATAGATAT
Fly-1	Mg	BC10	GCGCGCGCACTCTCTG
Fly-1	Hg	BC11	GTGTGTCTCGATGCGC
Fly-1	Fb	BC12	TGTCATATGAGAGTGT
Fly-1	Mp	BC13	TCTCGCGCGTGCACGC
Fly-1	Gd	BC14	GATATATCGAGTATAT
Fly-1	Ge	BC15	GTGTGCACTCACACTC
Fly-1	Tr	BC16	CGTGTCTAGCGCGCGC
Fly-1	Fg	BC17	GTGTGAGATATATATC
Fly-1	Ck	BC24	GTGACACACAGAGCAC
Fly-2	Sg	BC01	CACATATCAGAGTGCG
Fly-2	Ey	BC02	ACACACGCGAGACAGA
Fly-2	Br	BC03	ACGCGCTATCTCAGAG
Fly-2	Vc	BC04	CACACGCGCGCTATAT
Fly-2	L1	BC05	ACACACTCTATCAGAT
Fly-2	L2	BC06	CTATACATAGTGATGT
Fly-2	L3	BC07	TATATATGTCTATAGA
Fly-2	Wg	BC08	GAGACTAGAGATAGTG
Fly-2	Hl	BC09	ATGTGTATATAGATAT
Fly-2	Mg	BC10	GCGCGCGCACTCTCTG
Fly-2	Hg	BC11	GTGTGTCTCGATGCGC
Fly-2	Fb	BC12	TGTCATATGAGAGTGT
Fly-2	Mp	BC13	TCTCGCGCGTGCACGC
Fly-2	Gd	BC14	GATATATCGAGTATAT
Fly-2	Ge	BC15	GTGTGCACTCACACTC
Fly-2	Tr	BC16	CGTGTCTAGCGCGCGC
Fly-2	Fg	BC17	GTGTGAGATATATATC
Fly-2	Mo	BC18	GCGCACGCACTACAGA

For an annotated read, both sample barcodes (16 bp each) on 5' and 3' end were compared to the input barcode sequence.

The most similar barcode link the read to a putative organ.
Both sample barcodes should link to a identical organ.
The maxium edit distance of the sample barcodes to the input barcode, should be less than 10%, which means no more than 1 mismatch on the sample barcode.

Refer parse_feature.py script for more details.

Cluster Molecular Indentifier (UMI) Sequence

UMI sequences (14 bp) of each sample were clustered by usearch tool, with a a variant of the UCLUST algorithm. After a clustering, identical / similar UMI sequences can be linked to an unique ID.

The parameters for usearch tool are -id 0.95 -gapopen 3.0I/2.0E -gapext 1.0I/0.5E -match +2.0 -mismatch -20.0.

Refer cluster_umi.sh script for more details.

Group Reads by Sample Barcode and Molecular Indentifier

After the sample barcodes and the molecular indentifier have been annotated, the biological sequence of each stranded ccs read can be linked to a specific lineage barcode in an organ. Several biological sequence might link to one unique lineage barcode. We can utilize this information to generate consensus sequence, and eliminate sequence errors.

Refer run_annotate_umi.sh script for more details.

Generate Consensus Sequence for Lineage Barcode

The consensus sequence mentioned in this section is different from circular consensus sequnce (CCS). They are from different sequencing reads, but from identical lineage barcode.

Refer ccc.py script for more details.

Call Mutation

Refer call_mutation.py script for more details.

Post Quality Filter

Refer select_cluster.py script for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smrt

smrt

README.md

Single Molecule Sequecing Pipeline (smrt)

Adaptors Trimming

Circular Consensus Sequence Generation

Align to Reference Sequence

Annotate Sequence Features

Link Sample Barcode Sequence

Cluster Molecular Indentifier (UMI) Sequence

Group Reads by Sample Barcode and Molecular Indentifier

Generate Consensus Sequence for Lineage Barcode

Call Mutation

Post Quality Filter

Files

smrt

Directory actions

More options

Directory actions

More options

Latest commit

History

smrt

Folders and files

parent directory

README.md

Single Molecule Sequecing Pipeline (smrt)

Adaptors Trimming

Circular Consensus Sequence Generation

Align to Reference Sequence

Annotate Sequence Features

Link Sample Barcode Sequence

Cluster Molecular Indentifier (UMI) Sequence

Group Reads by Sample Barcode and Molecular Indentifier

Generate Consensus Sequence for Lineage Barcode

Call Mutation

Post Quality Filter