Implemented changes described in comments in the previous PR #13

luissian · 2024-01-08T20:38:26Z

The following changes were made for this PR:

Biopython is used to verify if a nucleotide seq can be converted to protein.
Error messages while converting are fetched from biopython
new parameter in analyze schema "cpus" is added
matching by allele name in prokka gtf output with the one in the gene file
Crated box plot for variability on allele length
Bar plot created for graphic showing range of alleles (x axis) and number of genes( y axis)
Stored in memory the annotation for each gene and save a tar.gz file with all
Included liting in actions
Implemented in code the comments requested on the previous PR
Replaced deprecated setup.py with pyproject.toml
Update actions to test analyze schema

…flow

saramonzon

Made some comments for you to check!
Let's see if that test finishes correctly!

.github/workflows/tests.yml

environment.yml

taranis/__main__.py

taranis/prediction.py

taranis/analyze_schema.py

saramonzon · 2024-01-19T14:55:54Z

taranis/analyze_schema.py

            for rec_id, seq_value in allele_seq.items():
-                unique_seq.remove(seq_value)
-                if seq_value in unique_seq:
+                if seq_value not in tmp_dict:


I feell that maybe this can be done without a tmp_dict?
Something like this?

value_to_keys = defaultdict(list) # Iterate over items and group keys by value for key, value in my_dict.items(): value_to_keys[value].append(key) # Filter out values with only one key (no duplicates) duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}

And the use the duplicates to create the a_quality

yes, code is nicer for getting only which are duplicated, however, I need to create another loop for getting the allele_id based on the sequence value to update the a_quality dictionary for setting the bad quality reason, which requires more CPU time. Because it should be like this:

value_to_keys = defaultdict(list) # Iterate over items and group keys by value for key, value in my_dict.items(): value_to_keys[value].append(key) # Filter out values with only one key (no duplicates) duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1} # Additional code for seq_id, seq_value in allele_seq.items(): if seq_value in duplicates: for id in duplicates[seq_value]: a_quality[id]["quality"] = "Bad quality" a_quality[id]["reason"] = "Duplicate allele"

Please correct me if I misunderstood.

mm don't understand, if you keep the seq_id in the duplicates list, you don't need to iterate over that, right?
Maybe you need to twist a little this in order to keep the seq_ids instead of the seq_values:

duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}

I was thinking on this code

value_to_keys = defaultdict(list) for rec_id, seq_value in allele_seq.items(): value_to_keys[seq_value].append(rec_id) if len(value_to_keys[seq_value]) > 1: a_quality[rec_id]["quality"] = "Bad quality" a_quality[rec_id]["reason"] = "Duplicate allele" if self.remove_duplicated: bad_quality_record.append(rec_id)

Updated this part of code having only 1 loop and without any temporary dictionary

saramonzon · 2024-01-19T14:56:26Z

taranis/analyze_schema.py

+                    if self.remove_duplicated:
+                        bad_quality_record.append(rec_id)
+
+        for rec_id, seq_value in allele_seq.items():


can't you include this check above so we only iterate one time through the dict?
Maybe changing the if clause position?

I can not as I need to remove it, for including in the bad quality record , once I know that it is duplicated

saramonzon · 2024-01-19T14:57:31Z

taranis/analyze_schema.py

-        for item in possible_bad_quality:
-            record_data[item] =  bad_quality_reason[item] if item in bad_quality_reason else 0
-        # record_data["bad_quality_reason"] = bad_quality_reason
+        for item in taranis.utils.POSIBLE_BAD_QUALITY:


I think you can do a double list comprenhension here, but not sure

Yes, I will modify it with this code

labels = taranis.utils.POSIBLE_BAD_QUALITY values = [stats_df[item].sum() for item in labels]

it is implemented this solution in code

…st data

saramonzon · 2024-01-23T12:00:07Z

.github/workflows/tests.yml


 jobs:
-  push_dockerhub:
-    name: Push new Docker image to Docker Hub (dev)
+  create-conda-env:


Suggested change

create-conda-env:

tests:

test file updated to use the pyroject.toml file instead of setup.py

- name: Activate env and install taranis run: | source $CONDA/etc/profile.d/conda.sh conda activate taranis_env poetry install taranis analyze-schema -i test/MLST_listeria -o analyze_schema_test --cpus 1 --output-allele-annot --remove-no-cds --remove-duplicated --remove-subset

luissian added 7 commits January 8, 2024 19:42

Update analyze schema with Comments in previous PR. Added liting work…

c29337a

…flow

fixiing some liting

92f379b

fixiing more liting errors

e1ed738

changed file extension for old python files

1d86220

fixing latest liting

c18365d

fixing latest liting 2

54533b1

fixing latest liting 3

1da9102

luissian marked this pull request as ready for review January 8, 2024 20:40

luissian added 9 commits January 16, 2024 16:47

checking liting in functions which are defined the type of variable

cfcc652

Checking testing file

eb00782

first draft to run test

d732fa0

first draft to run test

dccedc1

added deps to github action test

b19266d

Updated test and environment for conda installation

5f4e2ef

Removed python packages from conda and move to pip

8b70dbe

remove Self from annotation

e8eecd0

liting

f2ef7ad

saramonzon reviewed Jan 19, 2024

View reviewed changes

luissian added 12 commits January 22, 2024 14:14

Updated with comments in PR#13. Adding testing analyze schema with te…

84a52d3

…st data

fixing liting and error testing

917df0a

Again trying to fix liting and testing

e599433

modified schema input parameter

174da64

correcting wrong path of schema

12cfabc

including echo ls to know which is the working path

dc9b272

including ls to know which is the working path

8678110

removing variable

a6254bc

activate conda environment

88f4de8

added conda init before activate conda base

b022f43

activte conda env with the source command and the activate

db76c5b

testing how to run prokka

e372796

luissian added 9 commits January 22, 2024 18:55

testing how to run prokka_2

002ca49

testing 1

397da09

test2

58d7bad

test3

d3f53c4

test4

ba3276e

test5

983a7d5

test5

00f8692

test6

0003585

added kaleido package

55dbd45

saramonzon reviewed Jan 23, 2024

View reviewed changes

luissian added 13 commits January 23, 2024 19:17

Udpated code with latest comment in PR

573a253

replace deprecate setup.py for pyproject.toml

a740b91

reduce the 2 loop for checking the duplicated and sub allele

c8dcc7f

remove unnecesary comments

33130e2

testing new pyproject.toml file

98a1f63

fixing liting and update test

6cabd24

update installation taranis for testing

ef0f2b7

include all command in the same run

a8ae894

Split the number of cpus used for prokka and app

9b34397

commit to start with reference allele feature

22096fa

Including poetry.lock in gitignore

86fbcdc

Fixing liting

4a0a754

Fixing liting

0572e69

saramonzon approved these changes Jan 30, 2024

View reviewed changes

saramonzon merged commit 54fc96f into BU-ISCIII:develop Jan 30, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented changes described in comments in the previous PR #13

Implemented changes described in comments in the previous PR #13

luissian commented Jan 8, 2024 •

edited

Loading

saramonzon left a comment

saramonzon Jan 19, 2024

luissian Jan 22, 2024

saramonzon Jan 22, 2024

luissian Jan 23, 2024

luissian Jan 25, 2024

saramonzon Jan 19, 2024

luissian Jan 22, 2024

saramonzon Jan 19, 2024

luissian Jan 22, 2024

luissian Jan 25, 2024

saramonzon Jan 23, 2024

luissian Jan 25, 2024

Implemented changes described in comments in the previous PR #13

Implemented changes described in comments in the previous PR #13

Conversation

luissian commented Jan 8, 2024 • edited Loading

saramonzon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luissian commented Jan 8, 2024 •

edited

Loading