|
| 1 | +--- |
| 2 | +label: Downsample |
| 3 | +description: Downsample data by barcode |
| 4 | +icon: fold-down |
| 5 | +order: 10 |
| 6 | +--- |
| 7 | + |
| 8 | +# :icon-fold-down: Downsample data by barcode |
| 9 | + |
| 10 | +=== :icon-checklist: You will need one of either |
| 11 | +- one alignment file [!badge variant="success" text=".bam"] [!badge variant="success" text=".sam"] [!badge variant="secondary" text="case insensitive"] |
| 12 | +- one set of paired-end reads in FASTQ format [!badge variant="success" text=".fq"] [!badge variant="success" text=".fastq"] [!badge variant="secondary" text="gzip recommended"] [!badge variant="secondary" text="case insensitive"] |
| 13 | +=== |
| 14 | + |
| 15 | +While downsampling (subsampling) FASTQ and BAM files is relatively simple with tools such as `awk`, `samtools`, `seqtk`, `seqkit`, etc., |
| 16 | +[!badge corners="pill" text="downsample"] allows you to downsample a BAM file (or paired-end FASTQ) _by barcodes_. That means you can |
| 17 | +keep all the reads associated with `d` number of barcodes. The `--invalid` proportion will determine what proportion of invalid barcodes appear in the barcode |
| 18 | +pool that gets subsampled, where `0` is none, `1` is all invalid barcodes, and a number in between is that proportion, e.g. `0.5` is half. |
| 19 | +Bear in mind that the barcode pool still gets subsampled, so the `--invalid` proportion doesn't necessarily reflect how many end up getting |
| 20 | +sampled, rather what proportion will be considered for sampling. |
| 21 | + |
| 22 | +!!! Barcode tag |
| 23 | +Barcodes must be in the `BX:Z` SAM tag for both BAM and FASTQ inputs. See [Section 1 of the SAM Spec here](https://samtools.github.io/hts-specs/SAMtags.pdf). |
| 24 | +!!! |
| 25 | + |
| 26 | +```bash usage |
| 27 | +harpy downsample OPTIONS... INPUT(S)... |
| 28 | +``` |
| 29 | + |
| 30 | +```bash example |
| 31 | +# BAM file |
| 32 | +harpy downsample -d 1000 -i 0.3 -p sample1.sub1000 sample1.bam |
| 33 | + |
| 34 | +# FASTQ file |
| 35 | +harpy downsample -d 1000 -i 0 -p sample1.sub1000 sample1.F.fq.gz sample1.R.fq.gz |
| 36 | +``` |
| 37 | + |
| 38 | +## :icon-terminal: Running Options |
| 39 | +In addition to the [!badge variant="info" corners="pill" text="common runtime options"](/commonoptions.md), the [!badge corners="pill" text="downsample"] |
| 40 | +module is configured using the command-line arguments below. |
| 41 | + |
| 42 | +{.compact} |
| 43 | +| argument | short name | default | description | |
| 44 | +| :-------------- | :--------: | :-----------: | :-------------------------------------------------------------------------------------------------------------------------------- | |
| 45 | +| `INPUT(S)` | | | [!badge variant="info" text="required"] One BAM file or both read files from a paired-end FASTQ pair | |
| 46 | +| `--downsample` | `-d` | | [!badge variant="info" text="required"] Number of barcodes to downsample to | |
| 47 | +| `--invalid` | `-i` | `1` | Proportion of barcodes to sample | |
| 48 | +| `--prefix` | `-p` | `downsampled` | Prefix for output files | |
| 49 | +| `--random-seed` | | | Random seed for sampling [!badge variant="secondary" text="optional"] | |
| 50 | + |
| 51 | +---- |
| 52 | +## :icon-git-pull-request: Downsample Workflow |
| 53 | +```mermaid |
| 54 | +graph LR |
| 55 | + subgraph fastq |
| 56 | + R1([read 1]):::clean---R2([read 2]):::clean |
| 57 | + end |
| 58 | + subgraph bam |
| 59 | + bamfile([bam]):::clean |
| 60 | + end |
| 61 | + fastq-->|bam conversion|bam |
| 62 | + bam-->sub([extract and\n subsample barcodes]):::clean |
| 63 | + sub-->exreads([extract reads]):::clean |
| 64 | + bam-->exreads |
| 65 | + fastq-->exreads |
| 66 | + style fastq fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px |
| 67 | + style bam fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px |
| 68 | + classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px |
| 69 | +``` |
0 commit comments