Merge pull request #70 from NAL-i5K/add_user_defined

add user-defined example files
NAL-i5K · Sep 27, 2018 · 7b990ec · 7b990ec
2 parents 892a97a + 057792e
commit 7b990ec
Show file tree

Hide file tree

Showing 6 changed files with 67 additions and 3 deletions.
diff --git a/.appveyor.yml b/.appveyor.yml
@@ -30,6 +30,9 @@ test_script:
   - gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt
   - gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
   - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt
+  - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -u2 example_file/u2.txt -r merged_report.txt
+  - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -r merged_report.txt
+  - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u2 example_file/u2.txt -r merged_report.txt
   - gff3_merge -g1 example_file/new_models_w_replace.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt -noAuto
   - gff3_sort -g example_file/example.gff3 -og example-sorted.gff3
   - ps: Write-Host "Test scripts are finished ..."

diff --git a/.travis.yml b/.travis.yml
@@ -30,6 +30,9 @@ script:
   - gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt
   - gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
   - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt
+  - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -u2 example_file/u2.txt -r merged_report.txt
+  - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u1 example_file/u1.txt -r merged_report.txt
+  - gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -u2 example_file/u2.txt -r merged_report.txt
   - gff3_merge -g1 example_file/new_models_w_replace.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt -noAuto
   - gff3_sort -g example_file/example.gff3 -og example-sorted.gff3
   - gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences
diff --git a/docs/Merge-two-GFF3-files.md b/docs/Merge-two-GFF3-files.md
@@ -8,6 +8,8 @@
 
 [Automatically assigning replace tags](#automatically-assigning-replace-tags)
 
+[Rules for using user-defined files](#rules-for-using-user-defined-files)
+
 [Rules for adding a replace tag on your own](#rules-for-adding-a-replace-tag-on-your-own)
 
 [Replacing and adding models with  multiple isoforms](#replacing-and-adding-models-with-multiple-isoforms)
@@ -70,13 +72,56 @@ LGIB01000001.1  Gnomon  CDS     359515  359920  .       -       1       ID=cds33
 ### Automatically assigning replace tags
 ([back](#table-of-contents))
 
-You can choose to have the program auto-assign [replace tags](#replace-tags) for you. (This is the default behavior.) **The auto-assignment program ONLY works for mRNA features.** For all other feature types, if there is no replace tag, the program will add 'replace=NA'.  The program will identify which mRNA models from the modified GFF3 file overlap in coding sequence with models from the reference GFF3 file. The program will add a 'replace' attribute with the IDs of overlapping models. Specifically, the program will do the following:
-- Extract CDS and pre-mRNA sequences from mRNA features from both GFF3 files.
-- Use blastn to determine which sequences from the modified and reference GFF3 file align to each other **in their coding sequence**. These parameters are used: `-evalue 1e-10 -penalty -15 -ungapped`
+You can choose to have the program auto-assign [replace tags](#replace-tags) for you. (This is the default behavior.)  The program will identify which models from the modified GFF3 file overlap in coding/non-coding sequence with models from the reference GFF3 file. The program will add a 'replace' attribute with the IDs of overlapping models. Specifically, the program will do the following:
+- Extract CDS and pre-mRNA sequences from mRNA features from both GFF3 files. (For all other feature types, this program will extract transcript and pre-transcript from both GFF3 files)
+- Use blastn to determine which sequences from the modified and reference GFF3 file align to each other **in their coding/non-coding sequence**. These parameters are used: `-evalue 1e-10 -penalty -15 -ungapped`
 - If two models pass the alignment step, the program will add a 'replace' attribute with the ID of each overlapping model to the modified gff3 file.
 - If no reference model overlaps with a new model, then the program will add 'replace=NA'.
 - If one model overlaps another in an intron or UTR (but not within the coding sequence), the auto-assignment program will NOT assign a replace tag. This is because it's not always clear whether the overlapping model should be replaced. You will receive a warning message that this model does not have a replace tag and therefore was not incorporated into the merged gff3 file. You can then go back and manually add a replace tag to the original gff3 file. 
 
+### Rules for using user-defined files
+([back](#table-of-contents))
+
+By default, the program will only use exon to generate spliced sequences for transcripts. If you choose to have the program auto-assign replace tags but there is a model without exon features in your GFF3 files, then you must generate user-defined files for specifying parent and child features for sequences extraction.
+
+**Example**, a user-defined file for extracting CDS sequences from mRNA, using exon to generate spliced sequences for miRNA and using pseudogenic_exon to generate spliced sequences for pseudogenic_transcript.
+
+User-defined file:
+```
+mRNA CDS
+miRNA exon
+pseudogenic_transcript pseudogenic_exon
+```
+
+**Usage**: The user-defined can be specified via **--user_defined_file1** and **--user_defined_file2** argument. You can either give --user_defined_file1 for sequences extraction from updated GFF3 file or give --user_defined_file2 for sequences extraction from reference GFF3 file. Then, the program will use blastn to determine which sequences from the updated and reference GFF3 file align to each other. Specifically, the program will do the blastn with the following query and subject sequences:
+
+- If **--user_defined_file1** is given
+
+Query sequence | Subject sequence
+--- | ---
+user-defined sequences from updated GFF3 file | CDS sequences from reference GFF3 file
+user-defined sequences from updated GFF3 file | transcript sequences from reference GFF3 file
+pre-transcript sequences from updated GFF3 file | pre-transcript from reference GFF3 file
+
+- If **--user_defined_file2** is given
+
+Query sequence | Subject sequence
+--- | ---
+CDS sequences from updated GFF3 file | user-defined sequences from reference GFF3 file
+transcript sequences from updated GFF3 file | user-defined sequences from reference GFF3 file
+pre-transcript sequences from updated GFF3 file | pre-transcript from reference GFF3 file
+
+- If both **--user_defined_file1** and **--user_defined_file2** are given
+
+Query sequence | Subject sequence
+--- | ---
+user-defined sequences from updated GFF3 file | user-defined sequences from reference GFF3 file
+pre-transcript sequences from updated GFF3 file | pre-transcript from reference GFF3 file
+
+**Note**:
+- About the parent-child pair, the parent feature should be a transcript (e.g. mRNA, ncRNA) and the child feature is its children (e.g. exon, CDS).
+- This program will only generate sequences for the parent-child pair in the user-defined file.
+
 ### Rules for adding a replace tag on your own
 ([back](#table-of-contents))
 

diff --git a/example_file/u1.txt b/example_file/u1.txt
@@ -0,0 +1,3 @@
+transcript exon
+mRNA CDS
+mRNA exon
diff --git a/example_file/u2.txt b/example_file/u2.txt
@@ -0,0 +1,2 @@
+mRNA CDS
+mRNA exon
diff --git a/gff3tool/lib/gff3_merge/auto_replace_tag.py b/gff3tool/lib/gff3_merge/auto_replace_tag.py
@@ -76,7 +76,15 @@ def main(gff1, gff2, fasta, outdir, scode, logger, all_assign=False, user_define
                     transcripts.add(id)
     gff2_transcripts_type = set()
     if user_defined2 is None:
+        roots = []
         for line in gff3_2.lines:
+            try:
+                if line['line_type'] == 'feature':
+                    if 'Parent' not in line['attributes'] and len(line['attributes']) != 0:
+                        roots.append(line)
+            except KeyError:
+                pass
+        for root in roots:
             for child in root['children']:
                 if 'type' in child:
                     gff2_transcripts_type.add(child['type'])