diff --git a/README.md b/README.md
index 1bb17db..27c1f6f 100644
--- a/README.md
+++ b/README.md
@@ -9,73 +9,121 @@
# Mutation Motif
+`mutation_motif` provides capabilities for analysis of point mutation counts data. It includes commands for preparing sequence data, log-linear analyses of the resulting counts and sequence logo style visualisations. Two different analysis approaches are supported:
-This library provides capabilities for analysis of mutation properties.
-Two different analysis approaches are supported: (1) log-linear analysis
-of neighbourhood base influences on mutation coupled with a sequence
-logo like representation of influences (illustrated above); (2)
-log-linear analysis of mutation spectra, the relative proportions of
-different mutation directions from a starting base. A logo-like
-visualisation of the latter is also supported.
+1. log-linear analysis of neighbourhood base influences on mutation coupled with a sequence logo like representation of influences (illustrated above)
+2. log-linear analysis of mutation spectra, the relative proportions of different mutation directions from a starting base. A logo-like visualisation of the latter is also supported.
-The models and applications of them are described in [Zhu, Neeman, Yap
-and Huttley 2017 Statistical methods for identifying sequence motifs
-affecting point
-mutations](https://www.ncbi.nlm.nih.gov/pubmed/27974498).
+The description of the models and applications of them are described in [Zhu, Neeman, Yap and Huttley 2017 Statistical methods for identifying sequence motifs affecting point mutations](https://www.ncbi.nlm.nih.gov/pubmed/27974498).
## Installation
-We recommend installation of dependencies via conda since you also need to have the python bindings to [R
-installed](https://rpy2.readthedocs.io/en/latest/overview.html#installation).
-Follow the [miniconda install
-instructions](https://docs.conda.io/en/latest/miniconda.html) for your
-platform.
-
-Having installed miniconda, the following command creates a new conda
-environment `myenv` into which we install the essential requirements
-using conda, then use pip to install `mutation_motif`.
+You can just do a pip install
```
-$ conda env create -n myenv -c python=3.11
-$ conda activate myenv
-$ conda install -c conda-forge rpy2
-$ python -m pip install "mutation_motif @ git+https://github.com/HuttleyLab/MutationMotif.git@develop"
+$ pip install mutation_motif
```
-**Note:** The above installs the developer version. To use the release, change
-`develop` to `master`.
+## The commands
-## Usage
+The primary tool is installed as a command line executable, `mm`.
-The primary tool is installed as a command line executable,
-`mutation_analysis`. It requires a counts table where the table contains
-counts for a specified flank size (maximum of 2 bases, presumed to be
-either side of the mutated base). It assumes the counts all reflect a
-specific mutation direction (e.g. A to G) and that counts from a control
-distribution are also included. Two subcommands are available: `nbr` and
-`spectra`. The first examines the influence of neighbouring bases up to
-fourth order interactions. The latter contrasts the mutations from
-specified starting bases between groups.
+### Preparing data for analyses
-Data processing command line tools are `aln_to_counts` and `all_counts`.
-The first converts a fasta formatted alignment of equal length sequences
-to the required counts table format. The latter combines the separate
-counts tables into a larger table suitable for spectra analyses.
+#### The input sequence file format
-Visualisation of mutation motifs, or mutation spectra, in a grid is
-provided by `mutation_draw` with `nbr_grid` and `spectra_grid`
-subcommands.
+At present, the code reads in a fasta formatted file where each sequence has identical length. The length is an odd number and where the mutation occurred at the middle base. `mm` assumes each sequence file contains sequences that experienced the same point mutation at this central position, e.g. `seqs-CtoT.fasta` contains only sequences that have a C to T mutation at the central position. The sequence flanking the mutated base is used to derive a paired "unmutated" reference. The details of this sampling are in Zhu et al.
+
+Two data preparatory subcommands are available: `prep-nbr` and `prep-spectra`.
+
+
+prep-nbr: converts aligned sequences to counts
+
+`prep-make` converts a fasta formatted alignment of equal length sequences to the required counts table format.
+
+
+```
+Usage: mm prep-nbr [OPTIONS]
+
+ Export tab delimited counts table from alignment centred on SNP position.
+
+ Output file is written to the same path with just the file suffix changed from
+ fasta to txt.
+
+Options:
+ -a, --align_path TEXT fasta aligned file centred on mutated
+ position. [required]
+ -o, --output_path TEXT Path to write data. [required]
+ -f, --flank_size INTEGER Number of bases per side to include.
+ [required]
+ --direction [AtoC|AtoG|AtoT|CtoA|CtoG|CtoT|GtoA|GtoC|GtoT|TtoA|TtoC|TtoG]
+ Mutation direction. [required]
+ -S, --seed TEXT Seed for random number generator (e.g. 17, or
+ 2015-02-13). Defaults to system time.
+ -R, --randomise Randomises the observed data, observed and
+ reference counts distributions should match.
+ --step [1|2|3] Specifies a "frame" for selecting the random
+ base. [default: 1]
+ -D, --dry_run Do a dry run of the analysis without writing
+ output.
+ -F, --force_overwrite Overwrite existing files.
+ --help Show this message and exit.
-To see the options for the above commands do, for example:
```
-$ mutation_analysis --help
-$ aln_to_counts --help
+
+
+
+
+prep-spectra: combining mutation counts from multiple files
+
+This command combines the separate counts tables of `prep-nbr` into a larger table suitable for analyses by `ll-spectra`.
+
+
```
+Usage: mm prep-spectra [OPTIONS]
+
+ export tab delimited combined counts table by appending the 12 mutation
+ direction tables, adding a new column ``direction``.
+
+Options:
+ -c, --counts_pattern TEXT glob pattern uniquely identifying all 12 mutation
+ counts files.
+ -o, --output_path TEXT Path to write combined_counts data.
+ -s, --strand_symmetric produces table suitable for strand symmetry test.
+ -p, --split_dir TEXT path to write individual direction strand symmetric
+ tables.
+ -D, --dry_run Do a dry run of the analysis without writing
+ output.
+ -F, --force_overwrite Overwrite existing files.
+ --help Show this message and exit.
-### Counts table format
+```
+
+
-The counts table format has a simple structure, illustrated by the
-following:
+#### The output counts table format
+
+The counts table format has a simple structure, illustrated by the following:
| count | pos0 | pos1 | pos2 | pos3 | mut |
|--------| ------| ------| ------| ------| ----- |
@@ -87,82 +135,127 @@ following:
| 6932 | A | G | T | G | R |
| 10550 | A | A | A | A | R |
-The mutation status **must** be indicated by `R` (reference) and `M`
-(mutated). In this instance, the flank size is 2 and mutation was
-between `pos1` and `pos2`. Tables with this format are generated by
-`aln_to_counts`.
+The mutation status **must** be indicated by `R` (reference) and `M` (mutated). In this instance, the flank size is 2 and mutation was between `pos1` and `pos2`. Tables with this format are generated by `aln_to_counts`.
-### Sequence file format
+### Statistical analyses of mutations
-At present, the code reads in a fasta formatted file where each sequence
-has identical length. The length is an odd number and the mutation
-occurred at the middle base. The application assumes each sequence file
-contains sequences that experienced the same point mutation at this
-central position.
+The log-linear analyses requires a counts table from the prep steps. The table contains counts for a specified flank size (maximum of 2 bases, assumed to be either side of the mutated base). It assumes the counts all reflect a specific mutation direction (e.g. AtoG) and that counts from a control distribution are also included.
-## Evaluating the effect of neighbours on mutation
+Two subcommands are available: `ll-nbr` and `ll-spectra`.
+
+
+nbr: for detecting the influence of neighbouring bases on mutation
+
+The first examines the influence of neighbouring bases up to fourth order interactions.
+
+
+```
+Usage: mm ll-nbr [OPTIONS]
+
+ log-linear analysis of neighbouring base influence on point mutation
-Sample data files are included as `tests/data/counts-CtoT.txt` and
-`tests/data/counts-CtoT-ss.txt` with the latter being appropriate for
-analysis of the occurrence of strand asymmetric neighbour effects.
+ Writes estimated statistics, figures and a run log to the specified directory
+ outpath.
+ See documentation for count table format requirements.
+
+Options:
+ -1, --countsfile TEXT tab delimited file of counts.
+ -o, --outpath TEXT Directory path to write data.
+ -2, --countsfile2 TEXT second group motif counts file.
+ --first_order Consider only first order effects. Defaults to
+ considering up to 4th order interactions.
+ -s, --strand_symmetry single counts file but second group is strand.
+ -g, --group_label TEXT second group label.
+ -r, --group_ref TEXT reference group value for results presentation.
+ -v, --verbose Display more output.
+ -D, --dry_run Do a dry run of the analysis without writing output.
+ --help Show this message and exit.
+
+```
+
+
+
+
+ll-spectra: detect differences in mutation spectra between groups
+
+Contrasts the mutations from specified starting bases between groups.
+
+
+```
+Usage: mm ll-spectra [OPTIONS]
+
+ log-linear analysis of mutation spectra between groups
+
+Options:
+ -1, --countsfile TEXT tab delimited file of counts.
+ -o, --outpath TEXT Directory path to write data.
+ -2, --countsfile2 TEXT second group motif counts file.
+ -s, --strand_symmetry single counts file but second group is strand.
+ -F, --force_overwrite Overwrite existing files.
+ -D, --dry_run Do a dry run of the analysis without writing output.
+ -v, --verbose Display more output.
+ --help Show this message and exit.
+
+```
+
+
+
+Visualisation of mutation motifs, or mutation spectra, in a grid is provided by the `draw-`
+subcommands.
+
+## Evaluating the effect of neighbours on mutation
+
+Sample data files are included as `tests/data/counts-CtoT.txt` and `tests/data/counts-CtoT-ss.txt` with the latter being appropriate for analysis of the occurrence of strand asymmetric neighbour effects.
The simple analysis is invoked as:
```
-$ mutation_analysis nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/
+$ mm ll-nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/
```
-This will write 11 files into the results directory. Files such as
-`1.pdf` and `2.pdf` are the mutation motifs for the first and second
-order effects from the log-linear models. Files ending in `.json`
-contain the raw data used to produce these figures and may be used for
-subsequent analyses, such as generating grids of mutation motifs. The
-summary files summarises the full log-linear modelling hierarchy. The
-`.log` files track the command used to generate these files, including
+This will write 11 files into the results directory. Files such as `1.pdf` and `2.pdf` are the mutation motifs for the first and second order effects from the log-linear models. Files ending in `.json` contain the raw data used to produce these figures and may be used for subsequent analyses, such as generating grids of mutation motifs. The summary files include the full log-linear modelling hierarchy. The `.log` files track the command used to generate these files, including
the input files and the settings used.
Testing for strand symmetry (or asymmetry) is done as:
```
-$ mutation_analysis nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ --strand_symmetry
+$ mm ll-nbr -1 path/to/tests/data/counts-CtoT.txt -o path/for/results/ --strand_symmetry
```
-Similar output to the above is generated. The difference here is that
-the reference group for display are bases on the `+` strand.
+Similar output to the above is generated. The difference here is that the reference group for display are bases on the `+` strand.
-If comparing between groups, such as chromosomal regions, then there are
-two separate counts files and the second count file is indicated using a
-`-2` command line option.
+If comparing between groups, such as chromosomal regions, then there are two separate counts files and the second count file is indicated using a `-2` command line option.
## Testing Full Spectra
-Testing for strand symmetry requires the combined counts file, produced
-using the provided `all_counts` script. A sample such file is included
-as `tests/data/counts-combined.txt`. In this instance, a test of
-consistency in mutation spectra between strands is specified.
+Testing for strand symmetry requires the combined counts file, produced using the provided `all_counts` script. A sample such file is included as `tests/data/counts-combined.txt`. In this instance, a test of consistency in mutation spectra between strands is specified.
This analysis is run as:
```
-$ mutation_analysis spectra -1 path/to/tests/data/counts-combined.txt -o another/path/for/results/ --strand_symmetry
+$ mm ll-spectra -1 path/to/tests/data/counts-combined.txt -o another/path/for/results/ --strand_symmetry
```
## Drawing
-The `mutation_draw` command provides support for drawing either spectra
-or neighbour mutation motif logos. The subcommands are:
-
-- `grid`: draws an arbitrary shaped grid of mutation motifs based on a
- config file
-- `nbr`: makes motifs for independent or higher order interactions
-- `nbr-matrix`: draws square matrix of sequence logo\'s from neighbour
- analysis
-- `spectra-grid`: draws logo from mutation spectra analysis
-- `mi`: draws conventional sequence logo, using MI
-- `export-cfg`: exports the sample config files to the nominated path
-
-## Interpreting logo\'s
-
-If the plot is derived from a group comparison, the relative entropy
-terms (which specify the stack height, letter size and orientation) are
-taken from the mutated class belonging to group 1 (which is the counts
-file path assigned to the `-1` option). For example, if you specified
-`-1 file_a.txt -2 file_b.txt`, then large upright letters in the display
-indicate an excess in the mutated class from `file_a.txt` relative to
-`file_b.txt`.
+`mm` provides support for drawing either spectra or neighbour mutation motif logos.
+
+### Interpreting logo\'s
+
+If the plot is derived from a group comparison, the relative entropy terms (which specify the stack height, letter size and orientation) are taken from the mutated class belonging to group 1 (which is the counts file path assigned to the `-1` option). For example, if you specified `-1 file_a.txt -2 file_b.txt`, then large upright letters in the display indicate an excess in the mutated class from `file_a.txt` relative to `file_b.txt`.