Skip to content

06 infercnv update #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Jan 30, 2025

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

This PR is following the comment from the PR#994

What is the goal of this pull request?

-[ ] Major change: I modify the infercnv input gene_order to split the chromosoms into p and q arms.
To do so, I updated the script 06a_build-geneposition.R

The reason of this major change is that Wilms tumor have specific loss/gain of chromosome arms. I would like to see if having the arm information, I would be able to reproduce similar results as presented in Cresswell et al. Fig 1A. This would allow us to gain confidence in the infercnv step, which is a crucial step for the annotation.

-[ ] Minor change: I added a parameters at the top of 00_workflow.sh to set the predicted_celltype_threshold.

06_infercnv.R has been re-ran with the new gene_order and predicted_celltype_threshold. Results will be updated to the s3 bucket.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes, finishing PR#994
and another related PR investigating more into details the infercnv results.

Results

What is the name of your results bucket on S3?

researcher-008971640512-us-east-2

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

-[ ] in results/reference, the gene and arms position files, in a .txt format
-[ ] in results/SCPCS{sample_id}, the seurat object 06_infercnv_HMM-i3_{sample_id}_reference-both.rds containing the output of 06_infercnv.R and a 06_infercnv subfolder with some outputs of 06_infercnv.R that we decided to save.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@maud-p maud-p requested a review from jaclyn-taroni as a code owner January 30, 2025 13:44
@jaclyn-taroni jaclyn-taroni requested review from sjspielman and removed request for jaclyn-taroni January 30, 2025 14:17
@sjspielman
Copy link
Member

@maud-p I started reviewing this code, but some of the spacing in the script makes this difficult for me to read and run clearly. I was commenting in spots where I think this can be fixed to make the code easier for me to follow, but I decided it will actually be faster if I just update the spacing myself and file a PR to your branch. Then you can approve it and merge into this branch, and we'll continue with review here. Working on it right now!

@sjspielman
Copy link
Member

While working on this, I realized there was a way to dramatically simplify the code as well. But, while working on it, I noticed there are some discrepancies and I want to make sure the references you are using are from compatible genome annotations.

For example, consider this gene ENSG00000171163.

In the gene_order, I see these coordinates:

   ensembl_id      chrom gene_start  gene_end
   <chr>           <chr>      <dbl>     <dbl>
 1 ENSG00000171163 chr1   249144205 249153343

But, in chromosome_arms, I see these coordinates for chr1 overall:

   chrom                arm       start       end
   <chr>                <chr>     <int>     <int>
 1 chr1                 "p"           0 123400000
 2 chr1                 "q"   123400000 248956422

Therefore, it looks like the coordinates for this gene are higher than the end position for the q-arm. We used the Ensembl 104 annotation in the scpca-nf pipeline, so you'll want to find annotation files that are compatible with that reference. I see now that your files are at different versions - v19 and v29 - but the version of gencode we need is v38 which corresponds to Ensembl 104. Therefore, we need to back up here and get the right files - I imagine this might also effect the cell type annotation results too!

What I suggest we do now is: You can please find and update the URLs in this script to point to a download for annotation files at the correct version. Alternatively, we can use the same GTF file that was used in the scpca-nf pipeline to find coordinates. This is publicly available from S3 - let me know if you want to go that route and I can help you with the code to download and/or parse it.

Once we get the correct annotation files here, I can return to updating the script spacing to file a PR to you.

@maud-p
Copy link
Contributor Author

maud-p commented Jan 31, 2025

Dear @sjspielman ,

Thank you for pointing out the versions inconsistency!!!! Should be fixed!

It might improve the 1q profile that wasn't so great, let's hope 🤞

My apologies for the criptic code, I tried to improved it now using mosty dplyr, let me know if something isn't clear. I am alsways a bit afraid with merging and manipulating tables... 😨

Thank you!

correct wrong arm annotation
(genes that are both on X and Y chromosome need to be remove before infercnv)
the changes in cnv-threshold-low/high will be made in the next PR#994. To keep the workflow running without error, I went a step back with the parameter of `07_combined_annotation_across_sample_exploration.Rmd`
@maud-p
Copy link
Contributor Author

maud-p commented Jan 31, 2025

@sjspielman the workflow keeps failing as the changes made for 06_infercnv.R impact the following notebook 07_combined_annotation_across_samples_exploration.Rmd. Do we want to have the entire workflow running before merging this PR or could we still split the changes into 2 PR?

Most important, the script 06_infercnv.R is now running with the new gene position. 🥳 The output on the first 3 samples do not show huge differencies, which is quite reinsuring, but still some differences. Will be running over night/weekend!

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely heading in the right direction and code is easier to read, so thank you! That said, there are still some issues with the parsing as I note in my comments. I'd like to try again to help refactor this to make sure we're exporting the correct information, so what I'd like to confirm with you here is: Which columns are needed for inferCNV and how does this differ from what you'll need for the 07 notebook? Another way to ask, if we provide a column with arm to inferCNV, will it still work, or does that information need to be tracked separately?

As I noted in one of my comments, currently it doesn't seem that arm is ever exported, but I'm sure you'll want it since that's a reason we're doing this, right?

In terms of the CI, we can temporarily comment out that notebook code from running in 00_run_workflow.sh, and revive it in the next PR. So, we should aim to get this to pass with the understanding that we'll have to update in the next PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this file was accidentally duplicated - this copy should be removed.

output_format = 'html_document',
output_file = '07_combined_annotation_across_samples_exploration.html',
output_dir = '${notebook_dir}')"
output_dir = '${notebook_dir}')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub likes having new lines at the end of files, and it loves to remind us with this aggressive red symbol! 😃

Suggested change
output_dir = '${notebook_dir}')"
output_dir = '${notebook_dir}')"

@@ -7,19 +7,126 @@ module_base <- file.path(repository_base, "analyses", "cell-type-wilms-tumor-06"

# load library
library(tidyverse)
library(dplyr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since dplyr is loaded along with the core tidyverse in the previous line

Comment on lines 13 to 19
gene_order_file <- file.path(module_base, "results","references", 'gencode_v19_gene_pos.txt')

# Define the arm order file
arm_order_file <- file.path(module_base, "results","references", 'hg38_cytoBand.txt')

# Define the gene/arm combined information file
gene_arm_order_file <- file.path(module_base, "results","references", 'gencode_v29_gene_pos_arm.txt')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do this...

Suggested change
gene_order_file <- file.path(module_base, "results","references", 'gencode_v19_gene_pos.txt')
# Define the arm order file
arm_order_file <- file.path(module_base, "results","references", 'hg38_cytoBand.txt')
# Define the gene/arm combined information file
gene_arm_order_file <- file.path(module_base, "results","references", 'gencode_v29_gene_pos_arm.txt')
reference_dir <- file.path(module_base, "results", "references")
gene_order_file <- file.path(reference_dir, 'gencode_v19_gene_pos.txt')
# Define the arm order file
arm_order_file <- file.path(reference_dir, 'hg38_cytoBand.txt')
# Define the gene/arm combined information file
gene_arm_order_file <- file.path(reference_dir, 'gencode_v29_gene_pos_arm.txt')

@@ -1,25 +1,87 @@
# This script downloads and adapt the genome position file for use with infercnv

# load library
library(tidyverse)
library(stringr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed since included in tidyverse

Suggested change
library(stringr)

}

# Read the GTF file and extract gene positions
gtf_data <- read.table(gzfile(gene_order_file), header = FALSE, sep = "\t", comment.char = "#", stringsAsFactors = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be faster, but importantly, it uses X instead of V for column names. I do think you should use read_tsv as I suggest, but rather than changing everything from V -> X, instead rename the columns immediately to something informative, and then start working with the data. This will make code much easier to work with and maintain.

Suggested change
gtf_data <- read.table(gzfile(gene_order_file), header = FALSE, sep = "\t", comment.char = "#", stringsAsFactors = FALSE)
gtf_data <- read_tsv(gene_order_file, col_names = FALSE, comment = "#")

gtf_data <- read.table(gzfile(gene_order_file), header = FALSE, sep = "\t", comment.char = "#", stringsAsFactors = FALSE)

# Filter for gene entries only
gene_order <- gtf_data %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before entering this code, rename the data frame and then use those names directly - this is what you do for the cytoband file, and it would help here too.

}

# Load cytoBand file
cytoBand <- read.table(gzfile(arm_order_file), header = FALSE, sep = "\t", stringsAsFactors = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cytoBand <- read.table(gzfile(arm_order_file), header = FALSE, sep = "\t", stringsAsFactors = FALSE)
cytoBand <- read_tsv(arm_order_file, col_names = FALSE)

Comment on lines +63 to +65
gene_order <- gene_order %>%
# Join chromosome arm information with gene_order based on Chromosome
left_join(chromosome_arms, by = c("Chromosome" = "chrom")) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong here with the joining because you are ending up with duplicate columns. This can be addressed, but it's not the only issue; there are no p arms after the join.
Also in the end here, it seems you never actually export the arm, but isn't that we we need for the eventual downstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, in the next version I'll simplify this. The only information that I want to position a gene on the p or q arm is the centromere position = Start of the q arm = End of the p arm. If I only keep the p arm from the chromosome_arms before joining with gene_order, the centromere information will be in the End column and I avoid duplicating columns. Will be updated soon-ish 😄

I export the arm information when I do mutate(Chromosome = paste0(Chromosome, arm)), because for infercnv, the gene_order input is a table with 4 columns: the gene name/ID, the chromosome info to segment the heatmaps (in our case chromosome+arm) and the start/end position of the gene.

Comment on lines 20 to 24
# Download the updated gene order file for GENCODE v38 (Ensembl 104)
if (!file.exists(gene_order_file)) {
download.file(url = "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz",
destfile = gene_order_file)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to use a GTF file directly here, we should go ahead and use the exact one used with scpca-nf that prepared the data you're working with. That is publicly available from S3 and can be obtained with this command:

aws s3 cp s3://scpca-references/homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.gtf.gz {LOCAL TARGET DESTINATION} --no-sign-request

We should go ahead and replace the download.file with a system call to sync the file. I can potentially also take care of this in my refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now added the downloading command in the 00_run_worflow.sh, but not sure if this is the right place or the right way of doing it, please let me know 😄

@maud-p
Copy link
Contributor Author

maud-p commented Feb 5, 2025

@sjspielman I just see now that you worked on it yesterday and open a PR. I'll try to include your changes with my changes now! Thank you!

@maud-p maud-p mentioned this pull request Feb 5, 2025
@sjspielman
Copy link
Member

This PR has been replaced with #1026, so we can close this now.

@sjspielman sjspielman closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants