-
Notifications
You must be signed in to change notification settings - Fork 3
/
GUIDE.Rmd
470 lines (386 loc) · 23.1 KB
/
GUIDE.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
---
# This is the source code for the CHIIMP user guide. Rendering this document
# with knitr will produce a finished PDF file with R examples inline.
title: "CHIIMP User Guide"
author: "Jesse Connell"
date: "2022/04/25"
output:
pdf_document:
toc: true
toc_depth: 3
urlcolor: blue
linkcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
devtools::load_all(quiet = TRUE)
```
## Introduction
**CHIIMP** (Computational, High-throughput Individual Identification through
Microsatellite Profiling) is a program to analyze microsatellite (short tandem
repeat) DNA sequence data, producing genotypes from raw data and automating some
typical analysis tasks.
CHIIMP runs as a standalone tool, but is built as an [R] language package. All
functionality of the standalone program can be accessed from functions within R,
and the reporting and visualization functions are designed to integrate well
with [RStudio] and [R Markdown].
This document mostly focuses on CHIIMP as a standalone tool. For more
information on the use of specific functions within R, also see the built-in
package documentation.
## Installation
First install [R] and [RStudio], which will supply most software dependencies
for CHIIMP. Once these are installed, follow the specific instructions below
for your operating system. In all three cases CHIIMP performs an analysis when
a configuration file is dragged and dropped onto the desktop icon; there is no
interactive interface via the icon, though the R package can be used
interactively. See the Usage section for more information.
### Windows
On Windows, double-click the `install_windows.cmd` script. This will install
the package and R dependencies, and create a desktop shortcut.
### Mac OS
On Mac OS, right-click (control+click) the `install_mac.command` shell script,
select "Open," and also click "Open" in the window that appears to confirm that
really do want to open it. (Apple has specific instructions about these
security precautions [here](https://support.apple.com/kb/PH25088?locale=en_US).)
This will automatically install the package along with R dependencies and
create a desktop alias.
If a window appears recommending installation of the Mac OS command-line
developer tools, go ahead and install them. After that you'll probably need to
re-run the CHIIMP installer again to finish the install.
### Linux
On Linux, run the `install_linux.sh` shell script to automatically install the
package along with R dependencies. An icon for the program is created at
`$HOME/Desktop/CHIIMP.desktop`. Specific usage of the desktop icon will depend
on the desktop environment in use. (The `CHIIMP.desktop` text file references
the installed chiimp executable, and supplies the config file as a command-line
argument when dragged and dropped onto the icon.)
## Input Data Organization
The information CHIIMP uses during analysis is:
* Sequence files containing complete microsatellite sequences (FASTA or FASTQ,
plain text or gzip-compressed)
* Spreadsheet of dataset sample attributes
* Spreadsheet of locus attributes
* Spreadsheet of known individuals (optional)
* Spreadsheet of named alleles (optional)
The spreadsheets are in comma-separated (CSV) format. Column names are
important but not column order. Extra columns are imported as-is but are
otherwise ignored.
### Sequence Files
The sequence files must contain sequences that span complete microsatellites. No
assembly is performed to handle fragments of microsatellites, and the lengths of
sequences identified as alleles are reported as-is. (An implicit assumption
throughout the analysis is that any candidate allele sequence begins and ends
with conserved regions corresponding to the PCR primers used, and the forward
primer sequence is one of the filtering criteria during analysis.)
### Dataset Sample Attributes
The description of the samples to be analyzed can be provided in a spreadsheet,
or automatically loaded from the data file names. An example spreadsheet:
| `Filename` | `Replicate` | `Sample` | `Locus` |
|:------------------:|:-----------:|:--------:|:-------:|
| `100-1-A.fastq.gz` | `1` | `100` | `A` |
| `100-2-A.fastq.gz` | `2` | `100` | `A` |
| `100-1-B.fastq.gz` | `1` | `100` | `B` |
| `100-2-B.fastq.gz` | `2` | `100` | `B` |
| `100-1-1.fastq.gz` | `1` | `100` | `1` |
| `100-2-1.fastq.gz` | `2` | `100` | `1` |
| `100-1-2.fastq.gz` | `1` | `100` | `2` |
| `100-2-2.fastq.gz` | `2` | `100` | `2` |
| `101-1-A.fastq.gz` | `1` | `101` | `A` |
| `101-2-A.fastq.gz` | `2` | `101` | `A` |
| `101-1-B.fastq.gz` | `1` | `101` | `B` |
| `101-2-B.fastq.gz` | `2` | `101` | `B` |
| `101-1-1.fastq.gz` | `1` | `101` | `1` |
| `101-2-1.fastq.gz` | `2` | `101` | `1` |
| `101-1-2.fastq.gz` | `1` | `101` | `2` |
| `101-2-2.fastq.gz` | `2` | `101` | `2` |
These columns are required for each entry:
* Filename: The name of the sequence file to analyze. Note that if samples
were multiplexed on the sequencer by pooling PCR products for multiple loci,
filenames can be repeated here; just vary the text in the Locus column across
rows. The analysis will use the forward PCR primer (see Locus Attributes
below) to select just the sequences matching each locus as needed.
* Replicate: An identifier for a repeated case of the same biological sample.
If not applicable, the column may be left blank, but is still required.
* Sample: An identifier for a particular biological sample.
* Locus: The identifier for the locus being genotyped with. These must match
the identifiers in Locus Attributes below.
For simple cases that have a one-to-one match between sequence files and
sample/locus combinations, and with descriptive filenames following a consistent
pattern, the dataset table can be created automatically at run-time. See the
Usage section for more information.
### Locus Attributes
The description of the loci should be given in a spreadsheet with loci on rows
and attributes on columns. For example:
| `Locus` | `LengthMin` | `LengthMax` | `LengthBuffer` | `Motif` | `Primer` | `ReversePrimer` |
|:-------:| -----------:| -----------:| --------------:|:-------:|:-----------------:|:----------------:|
| `A` | `131` | `179` | `20` | `TAGA` | `TATCACTGGTGT...` | `CACAGTTGTGTG...`|
| `B` | `194` | `235` | `20` | `TAGA` | `AGTCTCTCTTTC...` | `TAGGAGCCTGTG...`|
| `1` | `232` | `270` | `20` | `TATC` | `ACAGTCAAGAAT...` | `CTGTGGCTCAAA...`|
| `2` | `218` | `337` | `20` | `TCCA` | `TTGTCTCCCCAG...` | `TCTGTCATAAAC...`|
These columns are required:
* Locus: A short unique identifier.
* LengthMin: The minimum expected sequence length in bases.
* LengthMax: The maximum expected sequence length in bases.
* LengthBuffer: An additional length below LengthMin or above LengthMax to
accept for candidate allele sequences. (If the length range of alleles for a
given locus is uncertain or unknown, this may be set very high to effectively
disable the length range requirement.)
* Motif: The short sequence repeating in tandem.
* Primer: The forward PCR primer used in preparing the sequencing library.
This is used as one of the checks for candidate allele sequences.
* ReversePrimer: The reverse PCR primer used in preparing the sequencing
library. This is not currently used unless `use_reverse_primers` is enabled
in the configuration.
### Known Individuals (Optional)
If a spreadsheet of genotypes for known individuals is supplied, the analysis
can attempt to match samples with the known genotypes automatically. For
example:
| `Name` | `Locus` | `Allele1Seq` | `Allele2Seq` |
|:---------:|:-------:|:----------------:|:----------------:|
| `CH001` | `A` | `ATTATCACTGG...` | `ATTATCACTGG...` |
| `CH001` | `B` | `TCAGTCTCTCT...` | |
| `CH001` | `1` | `AGACAGTCAAG...` | `AGACAGTCAAG...` |
| `CH001` | `2` | `CTTTGTCTCCC...` | `CTTTGTCTCCC...` |
| `CH002` | `A` | `ATTATCACTGG...` | `ATTATCACTGG...` |
| `CH002` | `B` | `TCAGTCTCTCT...` | `TCAGTCTCTCT...` |
| `CH002` | `1` | `AGACAGTCAAG...` | |
| `CH002` | `2` | `CTTTGTCTCCC...` | `CTTTGTCTCCC...` |
The order of the alleles given is not important, and homozygous individuals may
have Allele2Seq either left blank or set to a copy of Allele1Seq. The sequences
should contain any conserved region before and after the repeats including that
used for the PCR primers described above.
### Named Alleles (Optional)
If a spreadsheet of allele names and sequences is supplied, the analysis
will use those names in summary tables in the output report. For example:
| `Locus` | `Name` | `Seq` |
|:-------:|:-----------:|:----------------:|
| `A` | `200-a` | `ATTATCACTGG...` |
| `A` | `180-a` | `ATTATCACTGG...` |
| `A` | `180-b` | `ATTATCACTGG...` |
| `B` | `300-a` | `ATTATCACTGG...` |
| `B` | `305-a` | `ATTATCACTGG...` |
| `B` | `290-a` | `ATTATCACTGG...` |
The software will automatically create short allele names for any identified
allele not listed in the allele spreadsheet (or for all alleles if no
spreadsheet is given).
The automatic names are the sequence length and a sequence-specific suffix
separated by a hyphen, for example, "180-fdd1c6" for a 180 bp sequence with no
assigned name and particular sequence content. Any other 180 bp sequence would
receive a different suffix when the name is assigned.
## Algorithm
CHIIMP breaks the genotyping process into two parts. First a sample file is
de-replicated and a table of unique sequences is created, with no filtering yet
applied. Second the table is filtered to just candidate allele sequences, and
up to two sequences are reported as the genotype. Both the per-sequence table
and the final genotypes are saved in the final output, as spreadsheets in the
`processed-files` directory and as the `summary.csv` spreadsheet.
### Sample Processing
The table of unique sequences includes basic information for each case: sequence
content, length, and read counts observed. These are the Seq, Count, and Length
columns. The sequences are ordered by count with the most abundant first.
Additional columns associate each sequence with a particular locus, using the
locus attributes described above. First each locus' forward primer (and
optionally reverse as well) is compared with the sequence and the matching
locus name is stored in a MatchingLocus column. The sequence is then checked
for several tandem repeats of the motif for that locus, and compared to the
length range expected for that locus. TRUE/FALSE values for these are stored
in MotifMatch and LengthMatch columns respectively. The Ambiguous column marks
any sequences containing bases outside of A, C, T, and G (such as N).
The primer matching recognizes IUPAC ambiguity codes in the primer sequences
(e.g. an "N" signifies any of A, C, T, or G) but not in the read sequences.
By default no mismatches are allowed, but a maximum number of mismatches per
comparison can be defined in the configuration. The default behavior will not
modify the reads based on matched primers, but several read-modifying actions
are available via the `primer_action` configuration option:
* `none`: no change (the default). This is reasonable if adapter sequences
have already been trimmed, but otherwise one of the below should be used.
* `keep`: trim each reads to the primer but keep the primer region as-is.
Note that this would not work will with primer sequences containing IUPAC
ambiguity codes; in that situation use `replace` or `remove` to ensure that
reads with equivalent primer regions are grouped together.
* `replace`: trim each read *and* replace the matched primer region with the
sequence from the locus attributes table.
* `remove`: trim reads to exclude the primer regions.
These can also be specified for forward and reverse primers only with
`primer_action_fwd` and `primer_action_rev`, respectively.
PCR artifacts can obscure real allele sequences with incorrect sequences. There
are extra filters to attempt to remove these if possible or highlight cases that
may require further attention.
The sample data tables include "Stutter" and "Artifact" columns to mark entries
that look like possible polymerase stutter or other artifacts of another
sequence present at higher counts. For cases of potential polymerase stutter,
the higher-count sequence is one motif repeat longer. For
insertion/deletion/substitution artifacts the higher-count sequence is within 1
bp of the same length. In both cases the supposed artifact sequence will be
marked if the read counts are 1/3 or lower than the higher-count sequence.
(This represents a trade-off in sensitivity and specificity since genuine allele
sequences may differ in length by one or even zero repeats, and read counts for
pairs of alleles in a given sample can vary considerably.) Both of these
columns store row numbers for the higher-count sequence that an artifact may
have originated in, if found. Note that relative sequence lengths and counts
determine the outcome here, since sequence content for the artifacts is largely
indistinguishable from real allele sequences.
Lastly, the ratio of read counts for each sequence to the total reads in the
sample and the reads with the same MatchingLocus value is stored in
FractionOfTotal and FractionOfLocus columns respectively.
This is the `analyze_seqs` function in the R package.
### Genotype Calling
In the previous stage every single unique sequence for each data file was
described in a table, but no filtering or genotyping occurred. Now just one or
two candidate allele sequences are extracted from each table and reported as the
genotype.
First, the table rows are restricted to just those matching the expected locus'
primer, motif, and length range (using the MatchingLocus, MotifMatch, and
LengthMatch columns). If the resulting total read count is below a minimum
value (by default `r cfg("min_locus_reads")`, customizable via the
`min_locus_reads` setting) no genotyping will be attempted. Next only those
sequences accounting for at least a minimum fraction of the remaining reads are
considered. (The default value is `r cfg("min_allele_abundance")`. This can
be changed via the `min_allele_abundance` setting.) Sequences that are marked
as potential stutter or other artifacts (via the Stutter and Artifact columns
of the table) or contain ambiguous sequence content (via the Ambiguous column)
are excluded next.
After these filters are applied, the top one or two remaining sequences are
labeled as the alleles. (If only one sequence remains, the sample is labeled
homozygous; if two or more, heterozygous.) The final details kept for each
sample are:
* the sequence content, length, and counts for the one or two alleles
* the zygosity of the sample
* whether the ambiguous-sequence filter removed a potential allele
* whether the stutter and/or artifact filter removed a potential allele
* The read counts of the entire sample before any filtering
* The read counts of just those sequences matching the locus primer, motif, and
length range
These tasks (the filtering and categorizing of each sequence in the table and
the short genotype summary) are the `analyze_sample` and `summarize_sample`
functions in the R package.
### Summary and Reporting
The genotype and details identified in the previous step for each sample are
aggregated into a spreadsheet with a row for each sample. This summary
spreadsheet and the more detailed per-file and per-sample tables are all saved
in the final output.
For inter-sample comparisons, the alleles identified across samples for each
locus are aligned to one another. The genotypes for each sample are clustered
by number of matching alleles, showing similarity between samples. If a
spreadsheet of known genotypes was given, the sample genotypes are also compared
to the known genotypes, with any close matches reported. If a Name column was
provided with the sample definition table as well as a known genotypes
spreadsheet, the known-correct genotypes will be paired with applicable samples
and a column tracking the result of the genotyping (Correct, Incorrect, Blank,
or Dropped Allele) will be added. A single report document summarizes the
genotyping and these other details. See the Output Data Organization section
below for more information on the output.
These steps are handled by the `full_analysis` function in the R package.
## Usage
CHIIMP takes a configuration file as input and saves all output to a
folder. The configuration file points to all of the input data described above,
and specifies options for the analysis and output. All options have defaults,
so the file may be very brief or even empty. The file format a simple
two-column CSV spreadsheet, with a "Key" column storing the names of
configuration options and a "Value" column storing each associated value.
(The [YAML](http://yaml.org/) file used in previous CHIIMP versions is still
supported, but the new CSV format should be easier to to manage for those
unfamiliar with YAML.)
For example, a configuration file might have just two entries, showing the
spreadsheets to use for the samples and loci to analyze:
Key,Value
dataset,samples.csv
locus_attrs,locus_attrs.csv
The configuration file can be dragged and dropped onto the desktop icon
created during installation.
For command-line usage, the configuration file can be given as the first
argument to the R script installed with the package. (The location of the
script can be shown in R with `system.file("bin", "chiimp", package="chiimp")`.)
To run the same analysis within R, pass a list of configuration options to the
`chiimp::full_analysis()` function.
### Example Configuration File
The text in the example configuration file included here shows a slightly more
complex case:
```{r, echo=FALSE}
# Show example configuration file text
fp <- "inst/example_config.csv"
config_example <- load_config(fp)
kableExtra::kable_styling(knitr::kable(config_example, booktabs = TRUE, linesep = ""))
```
### Common Options
Below is a list of a few commonly-customized options. (See also the end of
this document for a full list with all default settings.) For more information
on the format of the spreadsheets listed here, see the "Input Data
Organization" section above.
* `dataset`: file path to table of sample attributes to use (rather than
detecting sample attributes based on filenames)
* `locus_attrs`: file path to locus attributes CSV file
* `allele_names`: file path to known alleles CSV file
* `genotypes_known`: file path to known genotypes CSV file
* `output_path`: directory path for saving output data
* `min_motif_repeats`: minimum number of perfect tandem motif repeats to
accept a read as a match for an STR locus. For example, if the `Motif`
specified for a locus (in the `locus_attrs` file) is `ACTC`, a setting of 3
here will require `ACTCACTCACTC` somewhere in each read.
* `max_stutter_ratio`: when considering a second-most-abundant sequence as
a possible allele, if it is one repeat shorter than the most abundant
sequence, it must exceed this fraction of the primer sequence abundance
* `max_artifact_ratio`: like `max_stutter_ratio`, but for secondary sequences
with lengths within 1 bp of the primary sequence.
* `use_reverse_primers`: should reverse primers be used for matching reads to
loci, or just forward?
* `reverse_primer_r1`: if using reverse primers, are the sequences given in
the locus attributes table oriented in the direction of the forward reads?
* `primer_action`: a key word for how to modify each read based on the
matched primer. (See Sample Processing section above.)
* `min_allele_abundance`: minimum fraction of locus-matched reads for a
sequence to be considered a candidate allele
* `min_locus_reads`: minimum number of locus-matched reads to attempt to call
a genotype
## Output Data Organization
A the end of an analysis CHIIMP creates a directory of files with all results.
* `summary.csv`: spreadsheet of the called genotypes and additional attributes
for each sample. Each sample is on a separate row, and each column corresponds
to a separate attribute in the results. This includes all columns in the input
dataset spreadsheet including locus, replicate, and sample identifiers, the
sequences, sequence lengths, and counts of the identified allele(s), and
several additional attributes.
* `processed-files`: directory of spreadsheets for each input data file. Each
spreadsheet contains one unique sequence per row with attributes on columns.
At this stage no filtering for sample/locus-specific attributes has been
applied. (This is particularly relevant for sequencer-multiplexed samples as
one input data file may contain data for multiple samples.)
* `processed-samples`: directory of spreadsheets for each sample. As for
`processed-files`, each spreadsheet contains one unique sequence per row with
attributes on columns. These represent the intermediate sample-specific data
CHIIMP uses to call a genotype for each sample, and each spreadsheet here
corresponds to a single row in the `summary.csv` file.
* `histograms`: directory of counts-versus-length histograms for each sample.
Counts are tallied on a by-sequence basis rather than by-length for alleles, so
the bars for called alleles (in red) are generally shorter than the bars for
unfiltered sequences (in black) or the matching-locus sequences (in pink).
* `allele-sequences`: directory of FASTA files for each sample, giving just the
sequence content also shown in `summary.csv`. (This is a convenience feature
to make the called alleles easily usable in a standard format, but the same
information is available in `summary.csv`.)
* `alignments`: directory of FASTA files for each locus, giving a multiple
alignment of all identified alleles per locus.
* `alignment-images`: directory of visualization images of the per-locus
alignments. These are also included in the report document.
* `report.html`: report document summarizing the genotyping results,
inter-sample comparisons, and (if known genotypes were provided), a comparison
of samples with known individuals.
An additional file will be created if `fp_rds` is defined in the `output`
setting of the configuration. This file contains all analysis results in a
single R object using R's native data serialization format for easy
post-analysis in R if desired.
These directory and file names are customizable in the `output` section of the
configuration.
## Full Configuration Options List
Configuration options and their defaults for CHIIMP version
`r devtools::as.package(".")$version`:
```{r echo=FALSE}
cfg_defaults <- subset(CFG_DEFAULTS, select = -c(Parser, OldName))
colnames(cfg_defaults)[colnames(cfg_defaults) == "Value"] <- "Default Value"
kableExtra::landscape(kableExtra::kable_styling(
knitr::kable(cfg_defaults,
booktabs = TRUE, linesep = ""), latex_options = "scale_down"))
```
[R]: https://www.r-project.org/
[RStudio]: https://www.rstudio.com/
[R Markdown]: https://rmarkdown.rstudio.com/