diff --git a/README.md b/README.md index 1bb17db..27c1f6f 100644 --- a/README.md +++ b/README.md @@ -9,73 +9,121 @@ # Mutation Motif +`mutation_motif` provides capabilities for analysis of point mutation counts data. It includes commands for preparing sequence data, log-linear analyses of the resulting counts and sequence logo style visualisations. Two different analysis approaches are supported: -This library provides capabilities for analysis of mutation properties. -Two different analysis approaches are supported: (1) log-linear analysis -of neighbourhood base influences on mutation coupled with a sequence -logo like representation of influences (illustrated above); (2) -log-linear analysis of mutation spectra, the relative proportions of -different mutation directions from a starting base. A logo-like -visualisation of the latter is also supported. +1. log-linear analysis of neighbourhood base influences on mutation coupled with a sequence logo like representation of influences (illustrated above) +2. log-linear analysis of mutation spectra, the relative proportions of different mutation directions from a starting base. A logo-like visualisation of the latter is also supported. -The models and applications of them are described in [Zhu, Neeman, Yap -and Huttley 2017 Statistical methods for identifying sequence motifs -affecting point -mutations](https://www.ncbi.nlm.nih.gov/pubmed/27974498). +The description of the models and applications of them are described in [Zhu, Neeman, Yap and Huttley 2017 Statistical methods for identifying sequence motifs affecting point mutations](https://www.ncbi.nlm.nih.gov/pubmed/27974498). ## Installation -We recommend installation of dependencies via conda since you also need to have the python bindings to [R -installed](https://rpy2.readthedocs.io/en/latest/overview.html#installation). -Follow the [miniconda install -instructions](https://docs.conda.io/en/latest/miniconda.html) for your -platform. - -Having installed miniconda, the following command creates a new conda -environment `myenv` into which we install the essential requirements -using conda, then use pip to install `mutation_motif`. +You can just do a pip install ``` -$ conda env create -n myenv -c python=3.11 -$ conda activate myenv -$ conda install -c conda-forge rpy2 -$ python -m pip install "mutation_motif @ git+https://github.com/HuttleyLab/MutationMotif.git@develop" +$ pip install mutation_motif ``` -**Note:** The above installs the developer version. To use the release, change -`develop` to `master`. +## The commands -## Usage +The primary tool is installed as a command line executable, `mm`. -The primary tool is installed as a command line executable, -`mutation_analysis`. It requires a counts table where the table contains -counts for a specified flank size (maximum of 2 bases, presumed to be -either side of the mutated base). It assumes the counts all reflect a -specific mutation direction (e.g. A to G) and that counts from a control -distribution are also included. Two subcommands are available: `nbr` and -`spectra`. The first examines the influence of neighbouring bases up to -fourth order interactions. The latter contrasts the mutations from -specified starting bases between groups. +### Preparing data for analyses -Data processing command line tools are `aln_to_counts` and `all_counts`. -The first converts a fasta formatted alignment of equal length sequences -to the required counts table format. The latter combines the separate -counts tables into a larger table suitable for spectra analyses. +#### The input sequence file format -Visualisation of mutation motifs, or mutation spectra, in a grid is -provided by `mutation_draw` with `nbr_grid` and `spectra_grid` -subcommands. +At present, the code reads in a fasta formatted file where each sequence has identical length. The length is an odd number and where the mutation occurred at the middle base. `mm` assumes each sequence file contains sequences that experienced the same point mutation at this central position, e.g. `seqs-CtoT.fasta` contains only sequences that have a C to T mutation at the central position. The sequence flanking the mutated base is used to derive a paired "unmutated" reference. The details of this sampling are in Zhu et al. + +Two data preparatory subcommands are available: `prep-nbr` and `prep-spectra`. + +
+prep-nbr: converts aligned sequences to counts + +`prep-make` converts a fasta formatted alignment of equal length sequences to the required counts table format. + + +``` +Usage: mm prep-nbr [OPTIONS] + + Export tab delimited counts table from alignment centred on SNP position. + + Output file is written to the same path with just the file suffix changed from + fasta to txt. + +Options: + -a, --align_path TEXT fasta aligned file centred on mutated + position. [required] + -o, --output_path TEXT Path to write data. [required] + -f, --flank_size INTEGER Number of bases per side to include. + [required] + --direction [AtoC|AtoG|AtoT|CtoA|CtoG|CtoT|GtoA|GtoC|GtoT|TtoA|TtoC|TtoG] + Mutation direction. [required] + -S, --seed TEXT Seed for random number generator (e.g. 17, or + 2015-02-13). Defaults to system time. + -R, --randomise Randomises the observed data, observed and + reference counts distributions should match. + --step [1|2|3] Specifies a "frame" for selecting the random + base. [default: 1] + -D, --dry_run Do a dry run of the analysis without writing + output. + -F, --force_overwrite Overwrite existing files. + --help Show this message and exit. -To see the options for the above commands do, for example: ``` -$ mutation_analysis --help -$ aln_to_counts --help + +
+ +
+prep-spectra: combining mutation counts from multiple files + +This command combines the separate counts tables of `prep-nbr` into a larger table suitable for analyses by `ll-spectra`. + + ``` +Usage: mm prep-spectra [OPTIONS] + + export tab delimited combined counts table by appending the 12 mutation + direction tables, adding a new column ``direction``. + +Options: + -c, --counts_pattern TEXT glob pattern uniquely identifying all 12 mutation + counts files. + -o, --output_path TEXT Path to write combined_counts data. + -s, --strand_symmetric produces table suitable for strand symmetry test. + -p, --split_dir TEXT path to write individual direction strand symmetric + tables. + -D, --dry_run Do a dry run of the analysis without writing + output. + -F, --force_overwrite Overwrite existing files. + --help Show this message and exit. -### Counts table format +``` + +
-The counts table format has a simple structure, illustrated by the -following: +#### The output counts table format + +The counts table format has a simple structure, illustrated by the following: | count | pos0 | pos1 | pos2 | pos3 | mut | |--------| ------| ------| ------| ------| ----- | @@ -87,82 +135,127 @@ following: | 6932 | A | G | T | G | R | | 10550 | A | A | A | A | R | -The mutation status **must** be indicated by `R` (reference) and `M` -(mutated). In this instance, the flank size is 2 and mutation was -between `pos1` and `pos2`. Tables with this format are generated by -`aln_to_counts`. +The mutation status **must** be indicated by `R` (reference) and `M` (mutated). In this instance, the flank size is 2 and mutation was between `pos1` and `pos2`. Tables with this format are generated by `aln_to_counts`. -### Sequence file format +### Statistical analyses of mutations -At present, the code reads in a fasta formatted file where each sequence -has identical length. The length is an odd number and the mutation -occurred at the middle base. The application assumes each sequence file -contains sequences that experienced the same point mutation at this -central position. +The log-linear analyses requires a counts table from the prep steps. The table contains counts for a specified flank size (maximum of 2 bases, assumed to be either side of the mutated base). It assumes the counts all reflect a specific mutation direction (e.g. AtoG) and that counts from a control distribution are also included. -## Evaluating the effect of neighbours on mutation +Two subcommands are available: `ll-nbr` and `ll-spectra`. + +
+nbr: for detecting the influence of neighbouring bases on mutation + +The first examines the influence of neighbouring bases up to fourth order interactions. + + +``` +Usage: mm ll-nbr [OPTIONS] + + log-linear analysis of neighbouring base influence on point mutation -Sample data files are included as `tests/data/counts-CtoT.txt` and -`tests/data/counts-CtoT-ss.txt` with the latter being appropriate for -analysis of the occurrence of strand asymmetric neighbour effects. + Writes estimated statistics, figures and a run log to the specified directory + outpath. + See documentation for count table format requirements. + +Options: + -1, --countsfile TEXT tab delimited file of counts. + -o, --outpath TEXT Directory path to write data. + -2, --countsfile2 TEXT second group motif counts file. + --first_order Consider only first order effects. Defaults to + considering up to 4th order interactions. + -s, --strand_symmetry single counts file but second group is strand. + -g, --group_label TEXT second group label. + -r, --group_ref TEXT reference group value for results presentation. + -v, --verbose Display more output. + -D, --dry_run Do a dry run of the analysis without writing output. + --help Show this message and exit. + +``` + +
+ +
+ll-spectra: detect differences in mutation spectra between groups + +Contrasts the mutations from specified starting bases between groups. + + +``` +Usage: mm ll-spectra [OPTIONS] + + log-linear analysis of mutation spectra between groups + +Options: + -1, --countsfile TEXT tab delimited file of counts. + -o, --outpath TEXT Directory path to write data. + -2, --countsfile2 TEXT second group motif counts file. + -s, --strand_symmetry single counts file but second group is strand. + -F, --force_overwrite Overwrite existing files. + -D, --dry_run Do a dry run of the analysis without writing output. + -v, --verbose Display more output. + --help Show this message and exit. + +``` + +
+ +Visualisation of mutation motifs, or mutation spectra, in a grid is provided by the `draw-` +subcommands. + +## Evaluating the effect of neighbours on mutation + +Sample data files are included as `tests/data/counts-CtoT.txt` and `tests/data/counts-CtoT-ss.txt` with the latter being appropriate for analysis of the occurrence of strand asymmetric neighbour effects. The simple analysis is invoked as: ``` -$ mutation_analysis nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ +$ mm ll-nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ ``` -This will write 11 files into the results directory. Files such as -`1.pdf` and `2.pdf` are the mutation motifs for the first and second -order effects from the log-linear models. Files ending in `.json` -contain the raw data used to produce these figures and may be used for -subsequent analyses, such as generating grids of mutation motifs. The -summary files summarises the full log-linear modelling hierarchy. The -`.log` files track the command used to generate these files, including +This will write 11 files into the results directory. Files such as `1.pdf` and `2.pdf` are the mutation motifs for the first and second order effects from the log-linear models. Files ending in `.json` contain the raw data used to produce these figures and may be used for subsequent analyses, such as generating grids of mutation motifs. The summary files include the full log-linear modelling hierarchy. The `.log` files track the command used to generate these files, including the input files and the settings used. Testing for strand symmetry (or asymmetry) is done as: ``` -$ mutation_analysis nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ --strand_symmetry +$ mm ll-nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ --strand_symmetry ``` -Similar output to the above is generated. The difference here is that -the reference group for display are bases on the `+` strand. +Similar output to the above is generated. The difference here is that the reference group for display are bases on the `+` strand. -If comparing between groups, such as chromosomal regions, then there are -two separate counts files and the second count file is indicated using a -`-2` command line option. +If comparing between groups, such as chromosomal regions, then there are two separate counts files and the second count file is indicated using a `-2` command line option. ## Testing Full Spectra -Testing for strand symmetry requires the combined counts file, produced -using the provided `all_counts` script. A sample such file is included -as `tests/data/counts-combined.txt`. In this instance, a test of -consistency in mutation spectra between strands is specified. +Testing for strand symmetry requires the combined counts file, produced using the provided `all_counts` script. A sample such file is included as `tests/data/counts-combined.txt`. In this instance, a test of consistency in mutation spectra between strands is specified. This analysis is run as: ``` -$ mutation_analysis spectra -1 path/to/tests/data/counts-combined.txt -o another/path/for/results/ --strand_symmetry +$ mm ll-spectra -1 path/to/tests/data/counts-combined.txt -o another/path/for/results/ --strand_symmetry ``` ## Drawing -The `mutation_draw` command provides support for drawing either spectra -or neighbour mutation motif logos. The subcommands are: - -- `grid`: draws an arbitrary shaped grid of mutation motifs based on a - config file -- `nbr`: makes motifs for independent or higher order interactions -- `nbr-matrix`: draws square matrix of sequence logo\'s from neighbour - analysis -- `spectra-grid`: draws logo from mutation spectra analysis -- `mi`: draws conventional sequence logo, using MI -- `export-cfg`: exports the sample config files to the nominated path - -## Interpreting logo\'s - -If the plot is derived from a group comparison, the relative entropy -terms (which specify the stack height, letter size and orientation) are -taken from the mutated class belonging to group 1 (which is the counts -file path assigned to the `-1` option). For example, if you specified -`-1 file_a.txt -2 file_b.txt`, then large upright letters in the display -indicate an excess in the mutated class from `file_a.txt` relative to -`file_b.txt`. +`mm` provides support for drawing either spectra or neighbour mutation motif logos. + +### Interpreting logo\'s + +If the plot is derived from a group comparison, the relative entropy terms (which specify the stack height, letter size and orientation) are taken from the mutated class belonging to group 1 (which is the counts file path assigned to the `-1` option). For example, if you specified `-1 file_a.txt -2 file_b.txt`, then large upright letters in the display indicate an excess in the mutated class from `file_a.txt` relative to `file_b.txt`.