Skip to content

Commit

Permalink
Rename “translate” to more correct “complement”
Browse files Browse the repository at this point in the history
  • Loading branch information
Franklin Delehelle committed Oct 3, 2018
1 parent 4797959 commit 09ea92d
Show file tree
Hide file tree
Showing 12 changed files with 153 additions and 111 deletions.
122 changes: 82 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,27 @@
# ASGART: a large duplications finder

`asgart` (A Segmental duplications Gathering and Refinement Tool) is a multiplatform (GNU/Linux, macOS, Windows) tool designed to search for large duplications amongst one or two DNA strands.
`asgart` (A Segmental duplications Gathering and Refinement Tool) is a
multiplatform (GNU/Linux, macOS, Windows) tool designed to search for
large duplications amongst one or two DNA strands.


## Licensing

Asgart is distributed under the GPLv3 license. Please see the LICENSE file.
Asgart is distributed under the GPLv3 license. Please see the LICENSE
file.

# Why should I use ASGART?

![A map of the Human genome long segmental duplications](screenshots/chord.png)
![A map of the Human genome long segmental
duplications](screenshots/chord.png)

You should use ASGART if

- you want to find segmental duplications, either direct, reversed and/or translate in a DNA sequence;
- you want to find segmental duplications, either direct, reversed
and/or complement in a DNA sequence;

- you want to find highly similar parts inbetween sequences up to the genome scale;
- you want to find highly similar parts inbetween sequences up to the
genome scale;

- you want to map highly similar sequences amongst genomes;

Expand All @@ -37,7 +43,8 @@ Binaries for Windows are not yet available.

## From sources

To build ASGART from sources, you need CMake, a C compiler and the [Rust compiler](https://www.rust-lang.org/en-US/install.html).
To build ASGART from sources, you need CMake, a C compiler and the
[Rust compiler](https://www.rust-lang.org/en-US/install.html).

Once these requirement are installed, clone the repository

Expand All @@ -54,7 +61,7 @@ You can then build ASGART by running the Rust building tool
cargo build --release
```

Once the build is finished, you'll find the binary in `target/release/`.
Once the build is finished, you will find the binary in `target/release/`.


# Usage
Expand All @@ -64,31 +71,35 @@ Once the build is finished, you'll find the binary in `target/release/`.
First, let us take a look at a simple example:

```
asgart seq.fasta seq.fasta 20 100
asgart seq.fasta seq.fasta
```

This command will look for duplications in the `seq.fasta` file, then
write them in a JSON file in the folder where it was launched. ASGART
will probe using 20-mers, and guarantee that no duplication will
include gaps longer than 100bp in their arm-to-arm pairwise alignment.

If you wish to look for reversed-translated duplications, use the
`-RT` option. And the `-v` option will give you more informations, as
well as a visual overview of the progress.
If you wish to look for reversed-complemented duplications, use the
`-R` and `-C` options, that can be combined in `-RC`. And the `-v`
option will give you more informations, as well as a visual overview
of the progress.

```
asgart seq.fasta seq.fasta 20 100 -RTv
asgart seq.fasta seq.fasta -RCv
```

## Input

As input, ASGART takes FASTA files containing the sequences within which to look for duplications. They can be either in the FASTA or multiFASTA format. If the input files are `s2
As input, ASGART takes FASTA files containing the sequences within
which to look for duplications. They can be either in the FASTA or
multiFASTA format.

## Output

### JSON

By default, ASGART will write its result in a JSON file in the folder where it was launched, following the following structure:
By default, ASGART will write its result in a JSON file in the folder
where it was launched, following the following structure:

```
{
Expand Down Expand Up @@ -116,16 +127,22 @@ By default, ASGART will write its result in a JSON file in the folder where it w
]
},
"kmer": probing kmer size,
"gap": maximum gap inbetween duplication arms,
"settings": {
"probe_size": probe size used,
"max_gap_size": maximal gap size used,
"min_duplication_length": minimal length for a duplicon,
"max_cardinality": maximal size of a family,
"skip_masked": were masked nucleotides skipped?,
"interlaced": were interlaced looked for?
},
"sds": [
{
"left": position of the left arm in the first file,
"right": position of the right arm in the second file,
"length": length of the duplication (bp),
"reversed": true if the duplication is reversed, false else,
"translated": true if the duplication is translated, false else
"complemented": true if the duplication is complemented, false else
},
...
]
Expand All @@ -134,8 +151,9 @@ By default, ASGART will write its result in a JSON file in the folder where it w

### GFF

ASGART can also write its results in GFF2 or GFF3 files by using the `--format` option. For instance, use `--format gff3` to
save the results in a GFF3 file.
ASGART can also write its results in GFF2 or GFF3 files by using the
`--format` option. For instance, use `--format gff3` to save the
results in a GFF3 file.

## Options

Expand All @@ -145,32 +163,42 @@ save the results in a GFF3 file.

- `--reverse`/`-R` look for duplication which second arm is reversed

- `--translate`/`-T` look for duplication which second arm is translated
- `--complement`/`-C` look for duplication which second arm is
complemented

- `--max-cardinality` specifies the maximal count of members in a duplication family (default: 1000)
- `--max-cardinality` specifies the maximal count of members in a
duplication family (default: 1000)

- `--min-length SIZE` specifies the minimal length (in bp) over which a duplication is kept in the final result and not discarded (default: 1000)
- `--min-length SIZE` specifies the minimal length (in bp) over
which a duplication is kept in the final result and not discarded
(default: 1000)

- `--skip-masked`/`-S` skip soft-masked zones, _i.e._ lowercased parts of the input files (default: no)
- `--skip-masked`/`-S` skip soft-masked zones, _i.e._ lowercased
parts of the input files (default: no)

### Technical

- `-h`, `--help` display an help screen

- `--out FILENAME` specifies the file in which the results will be written
- `--out FILENAME` specifies the file in which the results will be
written

- `--prefix NAME` defines a prefix to prepend to the standard out file name
- `--prefix NAME` defines a prefix to prepend to the standard out
file name

- `--format OUT_FORMAT` sets the output format. Default is `json`, but can be set to gff2 or gff3
- `--format OUT_FORMAT` sets the output format. Default is `json`,
but can be set to gff2 or gff3

- `--threads COUNT` set the numbers of thread to use. Defaults to the number of cores abailable on the CPU
- `--threads COUNT` set the numbers of thread to use. Defaults to
the number of cores abailable on the CPU

- `--trim START END` run ASGART only on the specified area of the first file
- `--trim START END` run ASGART only on the specified area of the
first file

# Plotting

ASGART comes with a plotting tool, producing a visual overview of the
duplications. Currently, two type of graphs are available: chord
duplications. Currently, two type of graphs are available: chord
graphs, or flat graphs.

## Options
Expand All @@ -179,28 +207,36 @@ graphs, or flat graphs.

- `--out FILENAME` set output file name

- `--min-length` set the minimal length (in bp) for a duplication to be plotted (default: 5000bp)
- `--min-length` set the minimal length (in bp) for a duplication to
be plotted (default: 5000bp)

- `--min-identity` set the minimal identity rate (in %) for a duplication to be plotted (default: 0%).
- `--min-identity` set the minimal identity rate (in %) for a
duplication to be plotted (default: 0%).

- `--no-direct` do not plot direct duplications

- `--no-reversed` do not plot reversed duplications

- `--no-untranslated` do not plot non-translated duplications
- `--no-uncomplemented` do not plot non-complemented duplications

- `--no-translated` do not plot translated duplications
- `--no-complemented` do not plot complemented duplications

- `--features FILE` add an additional track containing features to plot alongside the duplications.
- `--features FILE` add an additional track containing features to
plot alongside the duplications.

- `--filter-features DISTANCE` don't plot duplications that are farther away then `DISTANCE` bp from the features in the track.
- `--filter-features DISTANCE` don't plot duplications that are
farther away then `DISTANCE` bp from the features in the track.

### Feature file format

The feature file format contains a list of lines with three values separated by semi-colons.
The feature file format contains a list of lines with three values
separated by semi-colons.

1. The label of the feature.
2. the start of the feaure. It may either be a single integer representing its absolute coordinate, or be of the form `NAME+OFFSET`, defining a start position at `OFFSET` from the start of `NAME` chromosomes (from the input FASTA file).
2. the start of the feaure. It may either be a single integer
representing its absolute coordinate, or be of the form
`NAME+OFFSET`, defining a start position at `OFFSET` from the start
of `NAME` chromosomes (from the input FASTA file).
3. The length of the feaure in base pairs.

Comment lines starts with a `#`.
Expand All @@ -217,7 +253,10 @@ Foo;123456789;1250

## Chord graphs

A chord graph represents duplications amongst a DNA fragment as arcs linking point on a circle figuring a fragment bend over itself. Their width is directly proportional to the length of the duplications they represent.
A chord graph represents duplications amongst a DNA fragment as arcs
linking point on a circle figuring a fragment bend over itself. Their
width is directly proportional to the length of the duplications they
represent.

### Example

Expand All @@ -227,10 +266,13 @@ A chord graph represents duplications amongst a DNA fragment as arcs linking poi

## Flat graphs

Flat graphs are made of two superposed horizontal lines, representing the two fragments analyzed by ASGART, with lines linking left and right parts of the duplications found, their width proportional to the length of the duplication.
Flat graphs are made of two superposed horizontal lines, representing
the two fragments analyzed by ASGART, with lines linking left and
right parts of the duplications found, their width proportional to the
length of the duplication.

### Example

`asgart-plot human_Y.json flat --out=flat.svg --no-direct --no-untranslated --min-length 2000`
`asgart-plot human_Y.json flat --out=flat.svg --no-direct --no-uncomplemented --min-length 2000`

![Flat graph example](screenshots/flat.png)
2 changes: 1 addition & 1 deletion src/automaton.rs
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ fn make_duplications(psd: &ProtoSD,
length: size,
identity: 0.0,
reversed: false,
translated: false,
complemented: false,
});
}

Expand Down
45 changes: 22 additions & 23 deletions src/bin/asgart-plot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ use std::fs::File;
use std::path::Path;
use clap::{App, AppSettings};
use colored::Colorize;
use bio::io::gff;
use asgart::structs::*;
use asgart::plot::*;
use asgart::plot::chord_plot::ChordPlotter;
Expand Down Expand Up @@ -110,7 +109,7 @@ fn read_feature_file(r: &RunResult, file: &str) -> Result<Vec<Feature>> {
}
}

fn read_gff3_feature_file(r: &RunResult, file: &str) -> Result<Vec<Feature>> {
fn read_gff3_feature_file(_r: &RunResult, file: &str) -> Result<Vec<Feature>> {
let f = File::open(file).chain_err(|| format!("Unable to open {}", file))?;
let f = BufReader::new(f);

Expand Down Expand Up @@ -238,37 +237,37 @@ fn run() -> Result<()> {
let mut features_tracks = features_tracks.unwrap();


if args.is_present("no-direct") { result.sds.retain(|sd| sd.reversed) }
if args.is_present("no-reversed") { result.sds.retain(|sd| !sd.reversed) }
if args.is_present("no-untranslated") { result.sds.retain(|sd| sd.translated) }
if args.is_present("no-translated") { result.sds.retain(|sd| !sd.translated) }
if args.is_present("no-direct") { result.sds.retain(|sd| sd.reversed) }
if args.is_present("no-reversed") { result.sds.retain(|sd| !sd.reversed) }
if args.is_present("no-uncomplemented") { result.sds.retain(|sd| sd.complemented) }
if args.is_present("no-complemented") { result.sds.retain(|sd| !sd.complemented) }

if args.is_present("filter_duplications") {filter_sds_in_features(&mut result, &features_tracks, value_t!(args, "filter_duplications", usize).unwrap());}
if args.is_present("filter_features") {filter_features_in_sds(&mut result, &mut features_tracks, value_t!(args, "filter_features", usize).unwrap());}

let settings = Settings {
out_file: out_file,
out_file: out_file,

min_length: value_t!(args, "min_length", usize).unwrap(),
min_identity: value_t!(args, "min_identity", f32).unwrap(),
filter_direct: args.is_present("no-direct"),
filter_non_translated: args.is_present("no-untranslated"),
filter_reversed: args.is_present("no-reversed"),
filter_translated: args.is_present("no-translated"),
min_length: value_t!(args, "min_length", usize).unwrap(),
min_identity: value_t!(args, "min_identity", f32).unwrap(),
filter_direct: args.is_present("no-direct"),
filter_non_complemented: args.is_present("no-uncomplemented"),
filter_reversed: args.is_present("no-reversed"),
filter_complemented: args.is_present("no-complemented"),

size: 200.0,
thickness: 1.0,
color1: "#ff5b00".to_owned(),
color2: "#00b2ae".to_owned(),
size: 200.0,
thickness: 1.0,
color1: "#ff5b00".to_owned(),
color2: "#00b2ae".to_owned(),

feature_tracks: features_tracks,
feature_tracks: features_tracks,
};
result.sds = result.sds
.into_iter()
.filter(|sd| !(settings.filter_direct && !sd.reversed))
.filter(|sd| !(settings.filter_reversed && sd.reversed))
.filter(|sd| !(settings.filter_non_translated && !sd.translated))
.filter(|sd| !(settings.filter_translated && sd.translated))
.into_iter()
.filter(|sd| !(settings.filter_direct && !sd.reversed))
.filter(|sd| !(settings.filter_reversed && sd.reversed))
.filter(|sd| !(settings.filter_non_complemented && !sd.complemented))
.filter(|sd| !(settings.filter_complemented && sd.complemented))
.filter(|sd| sd.length >= settings.min_length)
.filter(|sd| sd.identity >= settings.min_identity)
.collect();
Expand Down
Loading

0 comments on commit 09ea92d

Please sign in to comment.