scarecrow

A toolkit for preprocessing single cell sequencing data.

Todo

Error handling implementing to capture missing or incorrect parameters, and unexpected file content
Peaks in between barcodes need further investigation
Plot generated by harvest currently will not handle > 1 barcode peak per whitelist
Benchmark different assays (SPLiTseq, Parse, 10X) and methods (split-pipe, scarecrow, UMI tools)
- barcode recovery
- alignment (STAR and kallisto)
Test alignment with kallisto and STAR
- may need to alter sequence header formatting depending on what is retained in BAM file

Environment

mamba create --name scarecrow python=3.12
mamba activate scarecrow
mamba install git pip
pip install git+https://github.com/pachterlab/seqspec.git
pip install git+https://github.com/MorganResearchLab/scarecrow.git
mamba install numpy, scipy, pandas, seaborn

Workflow

There are currently three tools: seed, harvest, reap, each of which feeds into the next. Using a small subset of paired-end sequencing reads (~10K ), the seed tool identifies barcode seeds within reads and records the counts to a CSV file. This should be run on each barcode whitelist if the library structure is uncertain.

# Find barcode seeds
scarecrow seed --fastqs <paired_fastq_R1> <paired_fastq_R2> --strands pos neg --out <BC1_counts.csv> \
    --barcodes BC1:v1:<v1_whitelist.txt> 
scarecrow seed --fastqs <paired_fastq_R1> <paired_fastq_R2> --strands pos neg --out <BC2_counts.csv> \
    --barcodes BC2:n198:<n198_whitelist.txt>
scarecrow seed --fastqs <paired_fastq_R1> <paired_fastq_R2> --strands pos neg --out <BC3_counts.csv> \
    --barcodes BC3:v3:<v3_whitelist.txt>

The resulting CSV files are then processed with harvest to identify barcode alignment peaks. The minimum distance between barcodes can be provided, and the number of peaks (--barcode_count) to return. Optionally, a conserved sequences TSV file generated by seed can be provided to exclude barcodes matching within these regions. The results are recorded to a CSV file and a histogram of the peaks is plotted to a PNG file. The CSV table should be edited if necessary to ensure that the barcode_whitelist values are unique.

# Harvest barcode positions 
scarecrow harvest <BC1_counts.csv> <BC2_counts.csv> <BC3_counts.csv> \
    --barcode_count 3 --min_distance 10 --conserved <BC1_conserved.tsv> \
    --out <barcode_positions.csv>

# Edit <barcode_positions.csv> barcode_whitelist so that each barcode (BC) has a unique name, e.g.:
#     read  start  end orientation  barcode_whitelist  read_count  read_fraction
# 0  read1     10   18     forward    [('BC1', 'v1')]        9485           0.95
# 1  read1     48   56     forward    [('BC2', 'v1')]        8393           0.84
# 2  read1     79   87     forward  [('BC3', 'n198')]        6002           0.60

Finally, the reap tool extracts a target sequence (i.e. cDNA) and its associated quality values from either read 1 or read2. This sequence data is written to a new fastq file, and the combination of cell barcodes identified near the positions previously estimated are appened to the sequence header. The jitter value is the flanking distance to extend the barcode search for from the start and end positions of the barcode peaks. mismatch is the maximum number of mismatches between an expected and observed barcode sequence.

# Reap target sequence from fastqs (TBC)
scarecrow reap --fastqs <paired_fastq_R1> <paired_fastq_R2> --barcode_positions <barcode_positions.csv> \
    --barcodes BC1:v1:<v1_whitelist.txt> BC2:v1:<v1_whitelist.txt> BC3:n198:<n198_whitelist.txt> \
    --jitter 2 --mismatches 2 --base_quality 30 --extract 1:1-64 --out cDNA.fastq

Parse Evercode WTv3 Example

OUTDIR=/uoa/home/s14dw4/scarecrow_test/MorganLab

# seed
R1=/uoa/scratch/shared/Morgan_Lab/Parse/LV6000778329-SO8229-University-of-Aberdeen-Morgan-Lab-Lib1-2024-12-10_S1_L001_R1_001.fastq.gz
R2=/uoa/scratch/shared/Morgan_Lab/Parse/LV6000778329-SO8229-University-of-Aberdeen-Morgan-Lab-Lib1-2024-12-10_S1_L001_R2_001.fastq.gz

# barcodes
BARCODES=(
  "BC1:R1_v3:${OUTDIR}/barcodes/bc_data_n123_R1_v3_5.csv"
  "BC2:v1:${OUTDIR}/barcodes/bc_data_v1.csv"
  "BC3:R3_v3:${OUTDIR}/barcodes/bc_data_R3_v3.csv"
  )

# Exract 100K reads
zcat -c ${R1} | head --l 400000 > ${OUTDIR}/fastq/100K_1.fastq 
zcat -c ${R2} | head --l 400000 > ${OUTDIR}/fastq/100K_2.fastq 

# Identify barcode seeds
for BARCODE in ${BARCODES[@]}
do
    scarecrow seed --fastqs ${R1} ${R2} --strands pos neg \
      -o ${OUTDIR}/results/barcodes_${BARCODE%:*:*}.csv --barcodes ${BARCODE}
done    

# Harvest barcode peaks
FILES=(${OUTDIR}/results/barcodes_*csv)
scarecrow harvest ${FILES[@]} --barcode_count 3 --min_distance 11 \
    --out ${OUTDIR}/barcode_positions.csv

# Reap cDNA
scarecrow reap --fastqs ${R1} ${R2} -p ${OUTDIR}/barcode_positions.csv \
    -j 5 -m 1 --barcodes ${BARCODES[@]} --read1 0-64 --out ./cDNA.fq.gz

Testing on laptop

R1=100K_1.fastq
R2=100K_2.fastq
BARCODES=(BC1:R1_v3:/Users/s14dw4/Documents/scarecrow_test/barcodes/bc_data_n123_R1_v3_5.barcodes
          BC2:v1:/Users/s14dw4/Documents/scarecrow_test/barcodes/bc_data_v1.barcodes
          BC3:R3_v3:/Users/s14dw4/Documents/scarecrow_test/barcodes/bc_data_R3_v3.barcodes)
for BARCODE in ${BARCODES[@]}
do
    scarecrow seed --fastqs ${R1} ${R2} --strands pos neg \
      -o ./results/barcodes_${BARCODE%:*:*}.csv --barcodes ${BARCODE}
done

FILES=(./results/barcodes_BC*csv)
scarecrow harvest ${FILES[@]} --barcode_count 3 --min_distance 11 \
    --conserved ./results/barcodes_BC1_conserved.tsv --out barcode_positions.csv

time scarecrow reap --fastqs ${R1} ${R2} -p ./barcode_positions.csv --barcode_reverse_order \
    -j 2 -m 2 -q 30 --barcodes ${BARCODES[@]} --extract 1:1-64 --umi 2:1-10 --out ./cDNA.fq --threads 4

scarecrow tally -f ./cDNA.fq

BARCODES=(BC1:n99_v5:./barcodes/bc_data_n99_v5.txt BC2:v1:./barcodes/bc_data_v1.txt BC3:v1:./barcodes/bc_data_v1.txt) FILES=(./barcode_profiles/Parse/barcodes.BC*csv) scarecrow harvest ${FILES[@]} --barcode_count 3 --min_distance 11
--conserved ./barcode_profiles/Parse/barcodes.BC1_conserved.tsv --out ./test.csv

Details on RG tags: https://github.com/samtools/hts-specs https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf https://www.biostars.org/p/9593008/

import pysam

def add_tags_to_sam(input_sam, output_sam):
    with pysam.AlignmentFile(input_sam, "r") as infile, pysam.AlignmentFile(output_sam, "w", header=infile.header) as outfile:
        for read in infile:
            # Extract the sequence header
            fastq_header = read.query_name
            
            # Parse the barcodes and UMI from the FASTQ header
            # Example header: "@LH00509:177:22W5HTLT3:1:1101:46251:1000 1:N:0:CAGATCAC+ATGTGAAG barcodes=TCTGATCC_GAACAGGC_ATCCTGTA positions=51_31_11 mismatches=0_0_0 UMI=NGAACTGAGT"
            try:
                details = fastq_header.split(" ")
                attributes = {item.split("=")[0]: item.split("=")[1] for item in details if "=" in item}
                
                # Add the CB and ZU tags if available
                if "barcodes" in attributes:
                    read.set_tag("CB", attributes["barcodes"], value_type="Z")
                if "UMI" in attributes:
                    read.set_tag("ZU", attributes["UMI"], value_type="Z")
            except Exception as e:
                print(f"Failed to parse header for read {fastq_header}: {e}")

            # Write the updated read to the output SAM file
            outfile.write(read)

# Input and output SAM file paths
input_sam = "input.sam"
output_sam = "output.sam"

add_tags_to_sam(input_sam, output_sam)

More efficient approach, which may require a lot of RAM

import pysam

def preprocess_fastq_headers(fastq_file):
    """Preprocess FASTQ headers to create a mapping of read names to tags."""
    read_tags = {}
    with open(fastq_file, "r") as fq:
        for line in fq:
            if line.startswith("@"):
                # Extract the read name and tags
                fastq_header = line.strip()
                read_name = fastq_header.split(" ")[0][1:]  # Remove "@" and split to get the read name
                attributes = {item.split("=")[0]: item.split("=")[1] for item in fastq_header.split() if "=" in item}
                
                # Store barcodes and UMI if they exist
                barcodes = attributes.get("barcodes", None)
                umi = attributes.get("UMI", None)
                read_tags[read_name] = {"CB": barcodes, "ZU": umi}
    return read_tags

def add_tags_to_sam(input_sam, output_sam, read_tags):
    """Add tags to SAM file based on the preprocessed FASTQ headers."""
    with pysam.AlignmentFile(input_sam, "r") as infile, pysam.AlignmentFile(output_sam, "w", header=infile.header) as outfile:
        for read in infile:
            if read.query_name in read_tags:
                tags = read_tags[read.query_name]
                if tags["CB"]:
                    read.set_tag("CB", tags["CB"], value_type="Z")
                if tags["ZU"]:
                    read.set_tag("ZU", tags["ZU"], value_type="Z")
            outfile.write(read)

# Paths to the input files
fastq_file = "input.fastq"
input_sam = "input.sam"
output_sam = "output.sam"

# Preprocess FASTQ headers and add tags
read_tags = preprocess_fastq_headers(fastq_file)
add_tags_to_sam(input_sam, output_sam, read_tags)

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
barcodes		barcodes
img		img
specs		specs
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scarecrow

Todo

Environment

Workflow

Parse Evercode WTv3 Example

Testing on laptop

About

Releases

Packages

Contributors 2

Languages

License

MorganResearchLab/scarecrow

Folders and files

Latest commit

History

Repository files navigation

scarecrow

Todo

Environment

Workflow

Parse Evercode WTv3 Example

Testing on laptop

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages