Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented changes described in comments in the previous PR #13

Merged
merged 50 commits into from
Jan 30, 2024

Conversation

luissian
Copy link
Member

@luissian luissian commented Jan 8, 2024

The following changes were made for this PR:

  • Biopython is used to verify if a nucleotide seq can be converted to protein.
  • Error messages while converting are fetched from biopython
  • new parameter in analyze schema "cpus" is added
  • matching by allele name in prokka gtf output with the one in the gene file
  • Crated box plot for variability on allele length
  • Bar plot created for graphic showing range of alleles (x axis) and number of genes( y axis)
  • Stored in memory the annotation for each gene and save a tar.gz file with all
  • Included liting in actions
  • Implemented in code the comments requested on the previous PR
  • Replaced deprecated setup.py with pyproject.toml
  • Update actions to test analyze schema

@luissian luissian marked this pull request as ready for review January 8, 2024 20:40
Copy link
Member

@saramonzon saramonzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments for you to check!
Let's see if that test finishes correctly!

.github/workflows/tests.yml Outdated Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
taranis/__main__.py Outdated Show resolved Hide resolved
taranis/__main__.py Show resolved Hide resolved
taranis/__main__.py Outdated Show resolved Hide resolved
taranis/prediction.py Outdated Show resolved Hide resolved
taranis/analyze_schema.py Outdated Show resolved Hide resolved
for rec_id, seq_value in allele_seq.items():
unique_seq.remove(seq_value)
if seq_value in unique_seq:
if seq_value not in tmp_dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feell that maybe this can be done without a tmp_dict?
Something like this?

 value_to_keys = defaultdict(list)

    # Iterate over items and group keys by value
    for key, value in my_dict.items():
        value_to_keys[value].append(key)

    # Filter out values with only one key (no duplicates)
    duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}

And the use the duplicates to create the a_quality

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, code is nicer for getting only which are duplicated, however, I need to create another loop for getting the allele_id based on the sequence value to update the a_quality dictionary for setting the bad quality reason, which requires more CPU time. Because it should be like this:

value_to_keys = defaultdict(list)

# Iterate over items and group keys by value
for key, value in my_dict.items():
    value_to_keys[value].append(key)

# Filter out values with only one key (no duplicates)
duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}

# Additional code
for seq_id, seq_value in allele_seq.items():
      if seq_value in duplicates:
          for id in duplicates[seq_value]:
              a_quality[id]["quality"] = "Bad quality"
              a_quality[id]["reason"] = "Duplicate allele"

Please correct me if I misunderstood.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm don't understand, if you keep the seq_id in the duplicates list, you don't need to iterate over that, right?
Maybe you need to twist a little this in order to keep the seq_ids instead of the seq_values:

duplicates = {value: keys for value, keys in value_to_keys.items() if len(keys) > 1}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking on this code

            value_to_keys = defaultdict(list)
            for rec_id, seq_value in allele_seq.items():
                value_to_keys[seq_value].append(rec_id)
                if len(value_to_keys[seq_value]) > 1:
                    a_quality[rec_id]["quality"] = "Bad quality"
                    a_quality[rec_id]["reason"] = "Duplicate allele"
                    if self.remove_duplicated:
                        bad_quality_record.append(rec_id)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this part of code having only 1 loop and without any temporary dictionary

if self.remove_duplicated:
bad_quality_record.append(rec_id)

for rec_id, seq_value in allele_seq.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you include this check above so we only iterate one time through the dict?
Maybe changing the if clause position?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not as I need to remove it, for including in the bad quality record , once I know that it is duplicated

for item in possible_bad_quality:
record_data[item] = bad_quality_reason[item] if item in bad_quality_reason else 0
# record_data["bad_quality_reason"] = bad_quality_reason
for item in taranis.utils.POSIBLE_BAD_QUALITY:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do a double list comprenhension here, but not sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will modify it with this code

        labels = taranis.utils.POSIBLE_BAD_QUALITY
        values = [stats_df[item].sum() for item in labels]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is implemented this solution in code


jobs:
push_dockerhub:
name: Push new Docker image to Docker Hub (dev)
create-conda-env:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
create-conda-env:
tests:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test file updated to use the pyroject.toml file instead of setup.py

- name: Activate env and install taranis
        run: |
          source $CONDA/etc/profile.d/conda.sh
          conda activate taranis_env
          poetry install
          taranis analyze-schema -i  test/MLST_listeria -o analyze_schema_test  --cpus 1 --output-allele-annot --remove-no-cds --remove-duplicated --remove-subset

@saramonzon saramonzon merged commit 54fc96f into BU-ISCIII:develop Jan 30, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants