-
Notifications
You must be signed in to change notification settings - Fork 20
06 infercnv update #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
06 infercnv update #1013
Conversation
@maud-p I started reviewing this code, but some of the spacing in the script makes this difficult for me to read and run clearly. I was commenting in spots where I think this can be fixed to make the code easier for me to follow, but I decided it will actually be faster if I just update the spacing myself and file a PR to your branch. Then you can approve it and merge into this branch, and we'll continue with review here. Working on it right now! |
While working on this, I realized there was a way to dramatically simplify the code as well. But, while working on it, I noticed there are some discrepancies and I want to make sure the references you are using are from compatible genome annotations. For example, consider this gene In the
But, in
Therefore, it looks like the coordinates for this gene are higher than the end position for the q-arm. We used the Ensembl 104 annotation in the What I suggest we do now is: You can please find and update the URLs in this script to point to a download for annotation files at the correct version. Alternatively, we can use the same GTF file that was used in the Once we get the correct annotation files here, I can return to updating the script spacing to file a PR to you. |
Dear @sjspielman , Thank you for pointing out the versions inconsistency!!!! Should be fixed! It might improve the 1q profile that wasn't so great, let's hope 🤞 My apologies for the criptic code, I tried to improved it now using mosty Thank you! |
correct wrong arm annotation
(genes that are both on X and Y chromosome need to be remove before infercnv)
the changes in cnv-threshold-low/high will be made in the next PR#994. To keep the workflow running without error, I went a step back with the parameter of `07_combined_annotation_across_sample_exploration.Rmd`
@sjspielman the workflow keeps failing as the changes made for Most important, the script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely heading in the right direction and code is easier to read, so thank you! That said, there are still some issues with the parsing as I note in my comments. I'd like to try again to help refactor this to make sure we're exporting the correct information, so what I'd like to confirm with you here is: Which columns are needed for inferCNV and how does this differ from what you'll need for the 07 notebook? Another way to ask, if we provide a column with arm
to inferCNV, will it still work, or does that information need to be tracked separately?
As I noted in one of my comments, currently it doesn't seem that arm
is ever exported, but I'm sure you'll want it since that's a reason we're doing this, right?
In terms of the CI, we can temporarily comment out that notebook code from running in 00_run_workflow.sh
, and revive it in the next PR. So, we should aim to get this to pass with the understanding that we'll have to update in the next PR.
analyses/00_run_workflow.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this file was accidentally duplicated - this copy should be removed.
output_format = 'html_document', | ||
output_file = '07_combined_annotation_across_samples_exploration.html', | ||
output_dir = '${notebook_dir}')" | ||
output_dir = '${notebook_dir}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub likes having new lines at the end of files, and it loves to remind us with this aggressive red symbol! 😃
output_dir = '${notebook_dir}')" | |
output_dir = '${notebook_dir}')" | |
@@ -7,19 +7,126 @@ module_base <- file.path(repository_base, "analyses", "cell-type-wilms-tumor-06" | |||
|
|||
# load library | |||
library(tidyverse) | |||
library(dplyr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed since dplyr
is loaded along with the core tidyverse
in the previous line
gene_order_file <- file.path(module_base, "results","references", 'gencode_v19_gene_pos.txt') | ||
|
||
# Define the arm order file | ||
arm_order_file <- file.path(module_base, "results","references", 'hg38_cytoBand.txt') | ||
|
||
# Define the gene/arm combined information file | ||
gene_arm_order_file <- file.path(module_base, "results","references", 'gencode_v29_gene_pos_arm.txt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do this...
gene_order_file <- file.path(module_base, "results","references", 'gencode_v19_gene_pos.txt') | |
# Define the arm order file | |
arm_order_file <- file.path(module_base, "results","references", 'hg38_cytoBand.txt') | |
# Define the gene/arm combined information file | |
gene_arm_order_file <- file.path(module_base, "results","references", 'gencode_v29_gene_pos_arm.txt') | |
reference_dir <- file.path(module_base, "results", "references") | |
gene_order_file <- file.path(reference_dir, 'gencode_v19_gene_pos.txt') | |
# Define the arm order file | |
arm_order_file <- file.path(reference_dir, 'hg38_cytoBand.txt') | |
# Define the gene/arm combined information file | |
gene_arm_order_file <- file.path(reference_dir, 'gencode_v29_gene_pos_arm.txt') |
@@ -1,25 +1,87 @@ | |||
# This script downloads and adapt the genome position file for use with infercnv | |||
|
|||
# load library | |||
library(tidyverse) | |||
library(stringr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed since included in tidyverse
library(stringr) |
} | ||
|
||
# Read the GTF file and extract gene positions | ||
gtf_data <- read.table(gzfile(gene_order_file), header = FALSE, sep = "\t", comment.char = "#", stringsAsFactors = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be faster, but importantly, it uses X
instead of V
for column names. I do think you should use read_tsv
as I suggest, but rather than changing everything from V
-> X
, instead rename the columns immediately to something informative, and then start working with the data. This will make code much easier to work with and maintain.
gtf_data <- read.table(gzfile(gene_order_file), header = FALSE, sep = "\t", comment.char = "#", stringsAsFactors = FALSE) | |
gtf_data <- read_tsv(gene_order_file, col_names = FALSE, comment = "#") |
gtf_data <- read.table(gzfile(gene_order_file), header = FALSE, sep = "\t", comment.char = "#", stringsAsFactors = FALSE) | ||
|
||
# Filter for gene entries only | ||
gene_order <- gtf_data %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before entering this code, rename the data frame and then use those names directly - this is what you do for the cytoband file, and it would help here too.
} | ||
|
||
# Load cytoBand file | ||
cytoBand <- read.table(gzfile(arm_order_file), header = FALSE, sep = "\t", stringsAsFactors = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cytoBand <- read.table(gzfile(arm_order_file), header = FALSE, sep = "\t", stringsAsFactors = FALSE) | |
cytoBand <- read_tsv(arm_order_file, col_names = FALSE) |
gene_order <- gene_order %>% | ||
# Join chromosome arm information with gene_order based on Chromosome | ||
left_join(chromosome_arms, by = c("Chromosome" = "chrom")) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is wrong here with the joining because you are ending up with duplicate columns. This can be addressed, but it's not the only issue; there are no p
arms after the join.
Also in the end here, it seems you never actually export the arm, but isn't that we we need for the eventual downstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, in the next version I'll simplify this. The only information that I want to position a gene on the p or q arm is the centromere position = Start
of the q arm = End
of the p arm. If I only keep the p arm from the chromosome_arms
before joining with gene_order
, the centromere information will be in the End
column and I avoid duplicating columns. Will be updated soon-ish 😄
I export the arm information when I do mutate(Chromosome = paste0(Chromosome, arm))
, because for infercnv
, the gene_order
input is a table with 4 columns: the gene name/ID, the chromosome info to segment the heatmaps (in our case chromosome+arm) and the start/end position of the gene.
# Download the updated gene order file for GENCODE v38 (Ensembl 104) | ||
if (!file.exists(gene_order_file)) { | ||
download.file(url = "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz", | ||
destfile = gene_order_file) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to use a GTF file directly here, we should go ahead and use the exact one used with scpca-nf
that prepared the data you're working with. That is publicly available from S3 and can be obtained with this command:
aws s3 cp s3://scpca-references/homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.gtf.gz {LOCAL TARGET DESTINATION} --no-sign-request
We should go ahead and replace the download.file
with a system call to sync the file. I can potentially also take care of this in my refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now added the downloading command in the 00_run_worflow.sh
, but not sure if this is the right place or the right way of doing it, please let me know 😄
@sjspielman I just see now that you worked on it yesterday and open a PR. I'll try to include your changes with my changes now! Thank you! |
This PR has been replaced with #1026, so we can close this now. |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
This PR is following the comment from the PR#994
What is the goal of this pull request?
-[ ] Major change: I modify the
infercnv
inputgene_order
to split the chromosoms into p and q arms.To do so, I updated the script
06a_build-geneposition.R
The reason of this major change is that Wilms tumor have specific loss/gain of chromosome arms. I would like to see if having the arm information, I would be able to reproduce similar results as presented in Cresswell et al. Fig 1A. This would allow us to gain confidence in the
infercnv
step, which is a crucial step for the annotation.-[ ] Minor change: I added a parameters at the top of
00_workflow.sh
to set thepredicted_celltype_threshold
.06_infercnv.R
has been re-ran with the newgene_order
andpredicted_celltype_threshold
. Results will be updated to the s3 bucket.If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes, finishing PR#994
and another related PR investigating more into details the
infercnv
results.Results
What is the name of your results bucket on S3?
researcher-008971640512-us-east-2
What types of results does your code produce (e.g., table, figure)?
What is your summary of the results?
-[ ] in
results/reference
, the gene and arms position files, in a.txt
format-[ ] in
results/SCPCS{sample_id}
, the seurat object06_infercnv_HMM-i3_{sample_id}_reference-both.rds
containing the output of06_infercnv.R
and a06_infercnv
subfolder with some outputs of06_infercnv.R
that we decided to save.Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.