DOC: updated readme to new api

HuttleyLab · Sep 22, 2024 · 770c08a · 770c08a
1 parent d3b3394
commit 770c08a
Showing 1 changed file with 196 additions and 103 deletions.
diff --git a/README.md b/README.md
@@ -9,73 +9,121 @@
 
 # Mutation Motif
 
+`mutation_motif` provides capabilities for analysis of point mutation counts data. It includes commands for preparing sequence data, log-linear analyses of the resulting counts and sequence logo style visualisations. Two different analysis approaches are supported:
 
-This library provides capabilities for analysis of mutation properties.
-Two different analysis approaches are supported: (1) log-linear analysis
-of neighbourhood base influences on mutation coupled with a sequence
-logo like representation of influences (illustrated above); (2)
-log-linear analysis of mutation spectra, the relative proportions of
-different mutation directions from a starting base. A logo-like
-visualisation of the latter is also supported.
+1. log-linear analysis of neighbourhood base influences on mutation coupled with a sequence logo like representation of influences (illustrated above)
+2. log-linear analysis of mutation spectra, the relative proportions of different mutation directions from a starting base. A logo-like visualisation of the latter is also supported.
 
-The models and applications of them are described in [Zhu, Neeman, Yap
-and Huttley 2017 Statistical methods for identifying sequence motifs
-affecting point
-mutations](https://www.ncbi.nlm.nih.gov/pubmed/27974498).
+The description of the models and applications of them are described in [Zhu, Neeman, Yap and Huttley 2017 Statistical methods for identifying sequence motifs affecting point mutations](https://www.ncbi.nlm.nih.gov/pubmed/27974498).
 
 ## Installation
 
-We recommend installation of dependencies via conda since you also need to have the python bindings to [R
-installed](https://rpy2.readthedocs.io/en/latest/overview.html#installation).
-Follow the [miniconda install
-instructions](https://docs.conda.io/en/latest/miniconda.html) for your
-platform.
-
-Having installed miniconda, the following command creates a new conda
-environment `myenv` into which we install the essential requirements
-using conda, then use pip to install `mutation_motif`.
+You can just do a pip install 
 
 ```
-$ conda env create -n myenv -c python=3.11
-$ conda activate myenv
-$ conda install -c conda-forge rpy2
-$ python -m pip install "mutation_motif @ git+https://github.com/HuttleyLab/MutationMotif.git@develop"
+$ pip install mutation_motif
 ```
 
-**Note:** The above installs the developer version. To use the release, change
-`develop` to `master`.
+## The commands
 
-## Usage
+The primary tool is installed as a command line executable, `mm`.
 
-The primary tool is installed as a command line executable,
-`mutation_analysis`. It requires a counts table where the table contains
-counts for a specified flank size (maximum of 2 bases, presumed to be
-either side of the mutated base). It assumes the counts all reflect a
-specific mutation direction (e.g. A to G) and that counts from a control
-distribution are also included. Two subcommands are available: `nbr` and
-`spectra`. The first examines the influence of neighbouring bases up to
-fourth order interactions. The latter contrasts the mutations from
-specified starting bases between groups.
+### Preparing data for analyses
 
-Data processing command line tools are `aln_to_counts` and `all_counts`.
-The first converts a fasta formatted alignment of equal length sequences
-to the required counts table format. The latter combines the separate
-counts tables into a larger table suitable for spectra analyses.
+#### The input sequence file format
 
-Visualisation of mutation motifs, or mutation spectra, in a grid is
-provided by `mutation_draw` with `nbr_grid` and `spectra_grid`
-subcommands.
+At present, the code reads in a fasta formatted file where each sequence has identical length. The length is an odd number and where the mutation occurred at the middle base. `mm` assumes each sequence file contains sequences that experienced the same point mutation at this central position, e.g. `seqs-CtoT.fasta` contains only sequences that have a C to T mutation at the central position. The sequence flanking the mutated base is used to derive a paired "unmutated" reference. The details of this sampling are in Zhu et al.
+
+Two data preparatory subcommands are available: `prep-nbr` and `prep-spectra`. 
+
+<details>
+<summary>prep-nbr: converts aligned sequences to counts</summary>
+
+`prep-make` converts a fasta formatted alignment of equal length sequences to the required counts table format.
+
+<!-- [[[cog
+import cog
+from mutation_motif.cli import main
+from click.testing import CliRunner
+runner = CliRunner()
+result = runner.invoke(main, ["prep-nbr"])
+help = result.output.replace("Usage: main", "Usage: mm")
+cog.out(
+    "```\n{}\n```".format(help)
+)
+]]] -->
+```
+Usage: mm prep-nbr [OPTIONS]
+
+  Export tab delimited counts table from alignment centred on SNP position.
+
+  Output file is written to the same path with just the file suffix changed from
+  fasta to txt.
+
+Options:
+  -a, --align_path TEXT           fasta aligned file centred on mutated
+                                  position.  [required]
+  -o, --output_path TEXT          Path to write data.  [required]
+  -f, --flank_size INTEGER        Number of bases per side to include.
+                                  [required]
+  --direction [AtoC|AtoG|AtoT|CtoA|CtoG|CtoT|GtoA|GtoC|GtoT|TtoA|TtoC|TtoG]
+                                  Mutation direction.  [required]
+  -S, --seed TEXT                 Seed for random number generator (e.g. 17, or
+                                  2015-02-13). Defaults to system time.
+  -R, --randomise                 Randomises the observed data, observed and
+                                  reference counts distributions should match.
+  --step [1|2|3]                  Specifies a "frame" for selecting the random
+                                  base.  [default: 1]
+  -D, --dry_run                   Do a dry run of the analysis without writing
+                                  output.
+  -F, --force_overwrite           Overwrite existing files.
+  --help                          Show this message and exit.
 
-To see the options for the above commands do, for example:
 ```
-$ mutation_analysis --help
-$ aln_to_counts --help
+<!-- [[[end]]] -->
+</details>
+
+<details>
+<summary>prep-spectra: combining mutation counts from multiple files</summary>
+
+This command combines the separate counts tables of `prep-nbr` into a larger table suitable for analyses by `ll-spectra`.
+
+<!-- [[[cog
+import cog
+from mutation_motif.cli import main
+from click.testing import CliRunner
+runner = CliRunner()
+result = runner.invoke(main, ["prep-spectra"])
+help = result.output.replace("Usage: main", "Usage: mm")
+cog.out(
+    "```\n{}\n```".format(help)
+)
+]]] -->
 ```
+Usage: mm prep-spectra [OPTIONS]
+
+  export tab delimited combined counts table by appending the 12 mutation
+  direction tables, adding a new column ``direction``.
+
+Options:
+  -c, --counts_pattern TEXT  glob pattern uniquely identifying all 12 mutation
+                             counts files.
+  -o, --output_path TEXT     Path to write combined_counts data.
+  -s, --strand_symmetric     produces table suitable for strand symmetry test.
+  -p, --split_dir TEXT       path to write individual direction strand symmetric
+                             tables.
+  -D, --dry_run              Do a dry run of the analysis without writing
+                             output.
+  -F, --force_overwrite      Overwrite existing files.
+  --help                     Show this message and exit.
 
-### Counts table format
+```
+<!-- [[[end]]] -->
+</details>
 
-The counts table format has a simple structure, illustrated by the
-following:
+#### The output counts table format
+
+The counts table format has a simple structure, illustrated by the following:
 
  | count  | pos0  | pos1  | pos2  | pos3  | mut |
  |--------| ------| ------| ------| ------| -----  |
@@ -87,82 +135,127 @@ following:
  | 6932   | A     | G     | T     | G     | R |
  | 10550  | A     | A     | A     | A     | R |
 
-The mutation status **must** be indicated by `R` (reference) and `M`
-(mutated). In this instance, the flank size is 2 and mutation was
-between `pos1` and `pos2`. Tables with this format are generated by
-`aln_to_counts`.
+The mutation status **must** be indicated by `R` (reference) and `M` (mutated). In this instance, the flank size is 2 and mutation was between `pos1` and `pos2`. Tables with this format are generated by `aln_to_counts`.
 
-### Sequence file format
+### Statistical analyses of mutations
 
-At present, the code reads in a fasta formatted file where each sequence
-has identical length. The length is an odd number and the mutation
-occurred at the middle base. The application assumes each sequence file
-contains sequences that experienced the same point mutation at this
-central position.
+The log-linear analyses requires a counts table from the prep steps. The table contains counts for a specified flank size (maximum of 2 bases, assumed to be either side of the mutated base). It assumes the counts all reflect a specific mutation direction (e.g. AtoG) and that counts from a control distribution are also included.
 
-## Evaluating the effect of neighbours on mutation
+Two subcommands are available: `ll-nbr` and `ll-spectra`. 
+
+<details>
+<summary>nbr: for detecting the influence of neighbouring bases on mutation</summary>
+
+The first examines the influence of neighbouring bases up to fourth order interactions.
+
+<!-- [[[cog
+import cog
+from mutation_motif.cli import main
+from click.testing import CliRunner
+runner = CliRunner()
+result = runner.invoke(main, ["ll-nbr"])
+help = result.output.replace("Usage: main", "Usage: mm")
+cog.out(
+    "```\n{}\n```".format(help)
+)
+]]] -->
+```
+Usage: mm ll-nbr [OPTIONS]
+
+  log-linear analysis of neighbouring base influence on point mutation
 
-Sample data files are included as `tests/data/counts-CtoT.txt` and
-`tests/data/counts-CtoT-ss.txt` with the latter being appropriate for
-analysis of the occurrence of strand asymmetric neighbour effects.
+  Writes estimated statistics, figures and a run log to the specified directory
+  outpath.
 
+  See documentation for count table format requirements.
+
+Options:
+  -1, --countsfile TEXT   tab delimited file of counts.
+  -o, --outpath TEXT      Directory path to write data.
+  -2, --countsfile2 TEXT  second group motif counts file.
+  --first_order           Consider only first order effects. Defaults to
+                          considering up to 4th order interactions.
+  -s, --strand_symmetry   single counts file but second group is strand.
+  -g, --group_label TEXT  second group label.
+  -r, --group_ref TEXT    reference group value for results presentation.
+  -v, --verbose           Display more output.
+  -D, --dry_run           Do a dry run of the analysis without writing output.
+  --help                  Show this message and exit.
+
+```
+<!-- [[[end]]] -->
+</details>
+
+<details>
+<summary>ll-spectra: detect differences in mutation spectra between groups</summary>
+
+Contrasts the mutations from specified starting bases between groups.
+
+<!-- [[[cog
+import cog
+from mutation_motif.cli import main
+from click.testing import CliRunner
+runner = CliRunner()
+result = runner.invoke(main, ["ll-spectra"])
+help = result.output.replace("Usage: main", "Usage: mm")
+cog.out(
+    "```\n{}\n```".format(help)
+)
+]]] -->
+```
+Usage: mm ll-spectra [OPTIONS]
+
+  log-linear analysis of mutation spectra between groups
+
+Options:
+  -1, --countsfile TEXT   tab delimited file of counts.
+  -o, --outpath TEXT      Directory path to write data.
+  -2, --countsfile2 TEXT  second group motif counts file.
+  -s, --strand_symmetry   single counts file but second group is strand.
+  -F, --force_overwrite   Overwrite existing files.
+  -D, --dry_run           Do a dry run of the analysis without writing output.
+  -v, --verbose           Display more output.
+  --help                  Show this message and exit.
+
+```
+<!-- [[[end]]] -->
+</details>
+
+Visualisation of mutation motifs, or mutation spectra, in a grid is provided by the `draw-`
+subcommands.
+
+## Evaluating the effect of neighbours on mutation
+
+Sample data files are included as `tests/data/counts-CtoT.txt` and `tests/data/counts-CtoT-ss.txt` with the latter being appropriate for analysis of the occurrence of strand asymmetric neighbour effects.
 
 The simple analysis is invoked as:
 ```
-$ mutation_analysis nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/
+$ mm ll-nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/
 ```
 
-This will write 11 files into the results directory. Files such as
-`1.pdf` and `2.pdf` are the mutation motifs for the first and second
-order effects from the log-linear models. Files ending in `.json`
-contain the raw data used to produce these figures and may be used for
-subsequent analyses, such as generating grids of mutation motifs. The
-summary files summarises the full log-linear modelling hierarchy. The
-`.log` files track the command used to generate these files, including
+This will write 11 files into the results directory. Files such as `1.pdf` and `2.pdf` are the mutation motifs for the first and second order effects from the log-linear models. Files ending in `.json` contain the raw data used to produce these figures and may be used for subsequent analyses, such as generating grids of mutation motifs. The summary files include the full log-linear modelling hierarchy. The `.log` files track the command used to generate these files, including
 the input files and the settings used.
 
 Testing for strand symmetry (or asymmetry) is done as:
 ```
-$ mutation_analysis nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ --strand_symmetry
+$ mm ll-nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ --strand_symmetry
 ```
-Similar output to the above is generated. The difference here is that
-the reference group for display are bases on the `+` strand.
+Similar output to the above is generated. The difference here is that the reference group for display are bases on the `+` strand.
 
-If comparing between groups, such as chromosomal regions, then there are
-two separate counts files and the second count file is indicated using a
-`-2` command line option.
+If comparing between groups, such as chromosomal regions, then there are two separate counts files and the second count file is indicated using a `-2` command line option.
 
 ## Testing Full Spectra
 
-Testing for strand symmetry requires the combined counts file, produced
-using the provided `all_counts` script. A sample such file is included
-as `tests/data/counts-combined.txt`. In this instance, a test of
-consistency in mutation spectra between strands is specified.
+Testing for strand symmetry requires the combined counts file, produced using the provided `all_counts` script. A sample such file is included as `tests/data/counts-combined.txt`. In this instance, a test of consistency in mutation spectra between strands is specified.
 
 This analysis is run as:
 ```
-$ mutation_analysis spectra -1 path/to/tests/data/counts-combined.txt -o another/path/for/results/ --strand_symmetry
+$ mm ll-spectra -1 path/to/tests/data/counts-combined.txt -o another/path/for/results/ --strand_symmetry
 ```
 ## Drawing
 
-The `mutation_draw` command provides support for drawing either spectra
-or neighbour mutation motif logos. The subcommands are:
-
--   `grid`: draws an arbitrary shaped grid of mutation motifs based on a
-    config file
--   `nbr`: makes motifs for independent or higher order interactions
--   `nbr-matrix`: draws square matrix of sequence logo\'s from neighbour
-    analysis
--   `spectra-grid`: draws logo from mutation spectra analysis
--   `mi`: draws conventional sequence logo, using MI
--   `export-cfg`: exports the sample config files to the nominated path
-
-## Interpreting logo\'s
-
-If the plot is derived from a group comparison, the relative entropy
-terms (which specify the stack height, letter size and orientation) are
-taken from the mutated class belonging to group 1 (which is the counts
-file path assigned to the `-1` option). For example, if you specified
-`-1 file_a.txt -2 file_b.txt`, then large upright letters in the display
-indicate an excess in the mutated class from `file_a.txt` relative to
-`file_b.txt`.
+`mm` provides support for drawing either spectra or neighbour mutation motif logos.
+
+### Interpreting logo\'s
+
+If the plot is derived from a group comparison, the relative entropy terms (which specify the stack height, letter size and orientation) are taken from the mutated class belonging to group 1 (which is the counts file path assigned to the `-1` option). For example, if you specified `-1 file_a.txt -2 file_b.txt`, then large upright letters in the display indicate an excess in the mutated class from `file_a.txt` relative to `file_b.txt`.