Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add argument to generate non-unique CDS IDS for a given mRNA parent feature #78

Closed
mpoelchau opened this issue Nov 16, 2018 · 8 comments
Assignees

Comments

@mpoelchau
Copy link
Contributor

mpoelchau commented Nov 16, 2018

The gff3 specification states that discontinuous features, such as CDS, need not have unique IDs. Instead they can share an ID to indicate that they are all part of a discontinuous feature. Whether or not you'll want unique or the same IDs for individual CDS lines of a given CDS feature usually depends on what you'll do with the gff downstream - for example, for Tripal ingest, CDS lines corresponding to a single feature should share an ID. So, it would be great if gff3_ID_generator.py had an option to not generate unique IDs for features that share a parent feature. For the user, I'd envision this as something like '-n'. Then, the program would only generate 1 ID for all CDS features that share a parent feature.

Example result one 1 gene with 2 isoforms using the proposed flag '-n CDS':

KZ848496.1      .       gene    715     17058   .       +       .       ID=LSTR000001;
KZ848496.1      .       mRNA    715     7345   .       +       .      Parent=LSTR000001;ID=LSTR000001-RA;
KZ848496.1      .       exon    715     899     .       +       .       ID=LSTR000001-RA-exon001;Parent=LSTR000001-RA
KZ848496.1      .       CDS     1418    1584    .       +       0       ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1      .       exon    7255    7345    .       +       .       ID=LSTR000001-RA-exon002;Parent=LSTR000001-RA
KZ848496.1      .       CDS     7255    7345    .       +       1       ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1      .       mRNA    13242     17058   .       +       .      Parent=LSTR000001;ID=LSTR000001-RB;
KZ848496.1      .       exon    13242   13331   .       +       .       ID=LSTR000001-RB-exon001;Parent=LSTR000001-RB;
KZ848496.1      .       CDS     13242   13331   .       +       1       ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;
KZ848496.1      .       exon    15348   17058   .       +       .       ID=LSTR000001-RB-exon002;Parent=LSTR000001-RB;
KZ848496.1      .       CDS     15348   15540   .       +       1       ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;
@tony006469
Copy link
Contributor

I tried to print out all the dictionaries to help me understand the data processing.
But I can't get any information about the loop of root.

screenshot

I am curious about what is the function of this loop.

@mpoelchau
Copy link
Contributor Author

@tony006469 here is the internal issue discussion for ID requirements: https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/525

@tony006469
Copy link
Contributor

tony006469 commented Dec 20, 2018

I divided it into two parts. The first part is that unspecified type will get uuid, the second part is that the specified type will share id, the first part has been completed yesterday, and the second part is currently in progress.

The screenshot shows that if I specify -t is EXON, other types can get uuid.
original
original

ac

@tony006469
Copy link
Contributor

I'm digging into the gff3.py to understand all the processing and data structure of gff3 file.

data.png

I tried to figure out what data columns is processed at each step of the generator.py and compare the differences between the id generator input and output files.

@tony006469
Copy link
Contributor

tony006469 commented Mar 13, 2019

I made a draft for this feature, just add argument -t to make the CDS type share ID.
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS
The screenshots of the output.gff and the report.txt are as follows.

out.gff
screenshot(draft).png

report.txt
screenshot(draft2).png

When we use original command (not add -t), it will generate uuid for each one.
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt
uuid.png

@tony006469
Copy link
Contributor

tony006469 commented Mar 14, 2019

https://github.com/NAL-i5K/GFF3toolkit/tree/uuid_cds

python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS

@tony006469
Copy link
Contributor

new_gff3.png

@tony006469
Copy link
Contributor

Pull Request: #90
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS

gff3_ID.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants