-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented changes described in comments in the previous PR #13
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments for you to check!
Let's see if that test finishes correctly!
taranis/analyze_schema.py
Outdated
for rec_id, seq_value in allele_seq.items(): | ||
unique_seq.remove(seq_value) | ||
if seq_value in unique_seq: | ||
if seq_value not in tmp_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feell that maybe this can be done without a tmp_dict?
Something like this?
value_to_keys = defaultdict(list)
# Iterate over items and group keys by value
for key, value in my_dict.items():
value_to_keys[value].append(key)
# Filter out values with only one key (no duplicates)
duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}
And the use the duplicates to create the a_quality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, code is nicer for getting only which are duplicated, however, I need to create another loop for getting the allele_id based on the sequence value to update the a_quality dictionary for setting the bad quality reason, which requires more CPU time. Because it should be like this:
value_to_keys = defaultdict(list)
# Iterate over items and group keys by value
for key, value in my_dict.items():
value_to_keys[value].append(key)
# Filter out values with only one key (no duplicates)
duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}
# Additional code
for seq_id, seq_value in allele_seq.items():
if seq_value in duplicates:
for id in duplicates[seq_value]:
a_quality[id]["quality"] = "Bad quality"
a_quality[id]["reason"] = "Duplicate allele"
Please correct me if I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm don't understand, if you keep the seq_id in the duplicates list, you don't need to iterate over that, right?
Maybe you need to twist a little this in order to keep the seq_ids instead of the seq_values:
duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking on this code
value_to_keys = defaultdict(list)
for rec_id, seq_value in allele_seq.items():
value_to_keys[seq_value].append(rec_id)
if len(value_to_keys[seq_value]) > 1:
a_quality[rec_id]["quality"] = "Bad quality"
a_quality[rec_id]["reason"] = "Duplicate allele"
if self.remove_duplicated:
bad_quality_record.append(rec_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this part of code having only 1 loop and without any temporary dictionary
if self.remove_duplicated: | ||
bad_quality_record.append(rec_id) | ||
|
||
for rec_id, seq_value in allele_seq.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you include this check above so we only iterate one time through the dict?
Maybe changing the if clause position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not as I need to remove it, for including in the bad quality record , once I know that it is duplicated
for item in possible_bad_quality: | ||
record_data[item] = bad_quality_reason[item] if item in bad_quality_reason else 0 | ||
# record_data["bad_quality_reason"] = bad_quality_reason | ||
for item in taranis.utils.POSIBLE_BAD_QUALITY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do a double list comprenhension here, but not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will modify it with this code
labels = taranis.utils.POSIBLE_BAD_QUALITY
values = [stats_df[item].sum() for item in labels]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is implemented this solution in code
.github/workflows/tests.yml
Outdated
|
||
jobs: | ||
push_dockerhub: | ||
name: Push new Docker image to Docker Hub (dev) | ||
create-conda-env: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create-conda-env: | |
tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test file updated to use the pyroject.toml file instead of setup.py
- name: Activate env and install taranis
run: |
source $CONDA/etc/profile.d/conda.sh
conda activate taranis_env
poetry install
taranis analyze-schema -i test/MLST_listeria -o analyze_schema_test --cpus 1 --output-allele-annot --remove-no-cds --remove-duplicated --remove-subset
The following changes were made for this PR: