Skip to content

Commit

Permalink
Merge pull request #70 from NAL-i5K/add_user_defined
Browse files Browse the repository at this point in the history
add user-defined example files
  • Loading branch information
dytk2134 authored Sep 27, 2018
2 parents 892a97a + 057792e commit 7b990ec
Show file tree
Hide file tree
Showing 6 changed files with 67 additions and 3 deletions.
3 changes: 3 additions & 0 deletions .appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ test_script:
- gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt
- gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -u2 example_file/u2.txt -r merged_report.txt
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -r merged_report.txt
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u2 example_file/u2.txt -r merged_report.txt
- gff3_merge -g1 example_file/new_models_w_replace.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt -noAuto
- gff3_sort -g example_file/example.gff3 -og example-sorted.gff3
- ps: Write-Host "Test scripts are finished ..."
Expand Down
3 changes: 3 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ script:
- gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt
- gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -u2 example_file/u2.txt -r merged_report.txt
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -r merged_report.txt
- gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u2 example_file/u2.txt -r merged_report.txt
- gff3_merge -g1 example_file/new_models_w_replace.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt -noAuto
- gff3_sort -g example_file/example.gff3 -og example-sorted.gff3
- gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences
51 changes: 48 additions & 3 deletions docs/Merge-two-GFF3-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

[Automatically assigning replace tags](#automatically-assigning-replace-tags)

[Rules for using user-defined files](#rules-for-using-user-defined-files)

[Rules for adding a replace tag on your own](#rules-for-adding-a-replace-tag-on-your-own)

[Replacing and adding models with multiple isoforms](#replacing-and-adding-models-with-multiple-isoforms)
Expand Down Expand Up @@ -70,13 +72,56 @@ LGIB01000001.1 Gnomon CDS 359515 359920 . - 1 ID=cds33
### Automatically assigning replace tags
([back](#table-of-contents))

You can choose to have the program auto-assign [replace tags](#replace-tags) for you. (This is the default behavior.) **The auto-assignment program ONLY works for mRNA features.** For all other feature types, if there is no replace tag, the program will add 'replace=NA'. The program will identify which mRNA models from the modified GFF3 file overlap in coding sequence with models from the reference GFF3 file. The program will add a 'replace' attribute with the IDs of overlapping models. Specifically, the program will do the following:
- Extract CDS and pre-mRNA sequences from mRNA features from both GFF3 files.
- Use blastn to determine which sequences from the modified and reference GFF3 file align to each other **in their coding sequence**. These parameters are used: `-evalue 1e-10 -penalty -15 -ungapped`
You can choose to have the program auto-assign [replace tags](#replace-tags) for you. (This is the default behavior.) The program will identify which models from the modified GFF3 file overlap in coding/non-coding sequence with models from the reference GFF3 file. The program will add a 'replace' attribute with the IDs of overlapping models. Specifically, the program will do the following:
- Extract CDS and pre-mRNA sequences from mRNA features from both GFF3 files. (For all other feature types, this program will extract transcript and pre-transcript from both GFF3 files)
- Use blastn to determine which sequences from the modified and reference GFF3 file align to each other **in their coding/non-coding sequence**. These parameters are used: `-evalue 1e-10 -penalty -15 -ungapped`
- If two models pass the alignment step, the program will add a 'replace' attribute with the ID of each overlapping model to the modified gff3 file.
- If no reference model overlaps with a new model, then the program will add 'replace=NA'.
- If one model overlaps another in an intron or UTR (but not within the coding sequence), the auto-assignment program will NOT assign a replace tag. This is because it's not always clear whether the overlapping model should be replaced. You will receive a warning message that this model does not have a replace tag and therefore was not incorporated into the merged gff3 file. You can then go back and manually add a replace tag to the original gff3 file.

### Rules for using user-defined files
([back](#table-of-contents))

By default, the program will only use exon to generate spliced sequences for transcripts. If you choose to have the program auto-assign replace tags but there is a model without exon features in your GFF3 files, then you must generate user-defined files for specifying parent and child features for sequences extraction.

**Example**, a user-defined file for extracting CDS sequences from mRNA, using exon to generate spliced sequences for miRNA and using pseudogenic_exon to generate spliced sequences for pseudogenic_transcript.

User-defined file:
```
mRNA CDS
miRNA exon
pseudogenic_transcript pseudogenic_exon
```

**Usage**: The user-defined can be specified via **--user_defined_file1** and **--user_defined_file2** argument. You can either give --user_defined_file1 for sequences extraction from updated GFF3 file or give --user_defined_file2 for sequences extraction from reference GFF3 file. Then, the program will use blastn to determine which sequences from the updated and reference GFF3 file align to each other. Specifically, the program will do the blastn with the following query and subject sequences:

- If **--user_defined_file1** is given

Query sequence | Subject sequence
--- | ---
user-defined sequences from updated GFF3 file | CDS sequences from reference GFF3 file
user-defined sequences from updated GFF3 file | transcript sequences from reference GFF3 file
pre-transcript sequences from updated GFF3 file | pre-transcript from reference GFF3 file

- If **--user_defined_file2** is given

Query sequence | Subject sequence
--- | ---
CDS sequences from updated GFF3 file | user-defined sequences from reference GFF3 file
transcript sequences from updated GFF3 file | user-defined sequences from reference GFF3 file
pre-transcript sequences from updated GFF3 file | pre-transcript from reference GFF3 file

- If both **--user_defined_file1** and **--user_defined_file2** are given

Query sequence | Subject sequence
--- | ---
user-defined sequences from updated GFF3 file | user-defined sequences from reference GFF3 file
pre-transcript sequences from updated GFF3 file | pre-transcript from reference GFF3 file

**Note**:
- About the parent-child pair, the parent feature should be a transcript (e.g. mRNA, ncRNA) and the child feature is its children (e.g. exon, CDS).
- This program will only generate sequences for the parent-child pair in the user-defined file.

### Rules for adding a replace tag on your own
([back](#table-of-contents))

Expand Down
3 changes: 3 additions & 0 deletions example_file/u1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
transcript exon
mRNA CDS
mRNA exon
2 changes: 2 additions & 0 deletions example_file/u2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mRNA CDS
mRNA exon
8 changes: 8 additions & 0 deletions gff3tool/lib/gff3_merge/auto_replace_tag.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,15 @@ def main(gff1, gff2, fasta, outdir, scode, logger, all_assign=False, user_define
transcripts.add(id)
gff2_transcripts_type = set()
if user_defined2 is None:
roots = []
for line in gff3_2.lines:
try:
if line['line_type'] == 'feature':
if 'Parent' not in line['attributes'] and len(line['attributes']) != 0:
roots.append(line)
except KeyError:
pass
for root in roots:
for child in root['children']:
if 'type' in child:
gff2_transcripts_type.add(child['type'])
Expand Down

0 comments on commit 7b990ec

Please sign in to comment.