06 infercnv update #1013

maud-p · 2025-01-30T13:44:26Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

This PR is following the comment from the PR#994

What is the goal of this pull request?

-[ ] Major change: I modify the infercnv input gene_order to split the chromosoms into p and q arms.
To do so, I updated the script 06a_build-geneposition.R

The reason of this major change is that Wilms tumor have specific loss/gain of chromosome arms. I would like to see if having the arm information, I would be able to reproduce similar results as presented in Cresswell et al. Fig 1A. This would allow us to gain confidence in the infercnv step, which is a crucial step for the annotation.

-[ ] Minor change: I added a parameters at the top of 00_workflow.sh to set the predicted_celltype_threshold.

06_infercnv.R has been re-ran with the new gene_order and predicted_celltype_threshold. Results will be updated to the s3 bucket.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes, finishing PR#994
and another related PR investigating more into details the infercnv results.

Results

What is the name of your results bucket on S3?

researcher-008971640512-us-east-2

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

-[ ] in results/reference, the gene and arms position files, in a .txt format
-[ ] in results/SCPCS{sample_id}, the seurat object 06_infercnv_HMM-i3_{sample_id}_reference-both.rds containing the output of 06_infercnv.R and a 06_infercnv subfolder with some outputs of 06_infercnv.R that we decided to save.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

sjspielman · 2025-01-30T15:32:58Z

@maud-p I started reviewing this code, but some of the spacing in the script makes this difficult for me to read and run clearly. I was commenting in spots where I think this can be fixed to make the code easier for me to follow, but I decided it will actually be faster if I just update the spacing myself and file a PR to your branch. Then you can approve it and merge into this branch, and we'll continue with review here. Working on it right now!

sjspielman · 2025-01-30T16:30:03Z

While working on this, I realized there was a way to dramatically simplify the code as well. But, while working on it, I noticed there are some discrepancies and I want to make sure the references you are using are from compatible genome annotations.

For example, consider this gene ENSG00000171163.

In the gene_order, I see these coordinates:

   ensembl_id      chrom gene_start  gene_end
   <chr>           <chr>      <dbl>     <dbl>
 1 ENSG00000171163 chr1   249144205 249153343

But, in chromosome_arms, I see these coordinates for chr1 overall:

   chrom                arm       start       end
   <chr>                <chr>     <int>     <int>
 1 chr1                 "p"           0 123400000
 2 chr1                 "q"   123400000 248956422

Therefore, it looks like the coordinates for this gene are higher than the end position for the q-arm. We used the Ensembl 104 annotation in the scpca-nf pipeline, so you'll want to find annotation files that are compatible with that reference. I see now that your files are at different versions - v19 and v29 - but the version of gencode we need is v38 which corresponds to Ensembl 104. Therefore, we need to back up here and get the right files - I imagine this might also effect the cell type annotation results too!

What I suggest we do now is: You can please find and update the URLs in this script to point to a download for annotation files at the correct version. Alternatively, we can use the same GTF file that was used in the scpca-nf pipeline to find coordinates. This is publicly available from S3 - let me know if you want to go that route and I can help you with the code to download and/or parse it.

Once we get the correct annotation files here, I can return to updating the script spacing to file a PR to you.

maud-p · 2025-01-31T11:34:18Z

Dear @sjspielman ,

Thank you for pointing out the versions inconsistency!!!! Should be fixed!

It might improve the 1q profile that wasn't so great, let's hope 🤞

My apologies for the criptic code, I tried to improved it now using mosty dplyr, let me know if something isn't clear. I am alsways a bit afraid with merging and manipulating tables... 😨

Thank you!

correct wrong arm annotation

(genes that are both on X and Y chromosome need to be remove before infercnv)

the changes in cnv-threshold-low/high will be made in the next PR#994. To keep the workflow running without error, I went a step back with the parameter of `07_combined_annotation_across_sample_exploration.Rmd`

maud-p · 2025-01-31T17:15:01Z

@sjspielman the workflow keeps failing as the changes made for 06_infercnv.R impact the following notebook 07_combined_annotation_across_samples_exploration.Rmd. Do we want to have the entire workflow running before merging this PR or could we still split the changes into 2 PR?

Most important, the script 06_infercnv.R is now running with the new gene position. 🥳 The output on the first 3 samples do not show huge differencies, which is quite reinsuring, but still some differences. Will be running over night/weekend!

sjspielman

This is definitely heading in the right direction and code is easier to read, so thank you! That said, there are still some issues with the parsing as I note in my comments. I'd like to try again to help refactor this to make sure we're exporting the correct information, so what I'd like to confirm with you here is: Which columns are needed for inferCNV and how does this differ from what you'll need for the 07 notebook? Another way to ask, if we provide a column with arm to inferCNV, will it still work, or does that information need to be tracked separately?

As I noted in one of my comments, currently it doesn't seem that arm is ever exported, but I'm sure you'll want it since that's a reason we're doing this, right?

In terms of the CI, we can temporarily comment out that notebook code from running in 00_run_workflow.sh, and revive it in the next PR. So, we should aim to get this to pass with the understanding that we'll have to update in the next PR.

sjspielman · 2025-01-30T15:15:22Z

analyses/00_run_workflow.sh

It looks like this file was accidentally duplicated - this copy should be removed.

sjspielman · 2025-01-30T15:16:24Z

analyses/cell-type-wilms-tumor-06/00_run_workflow.sh

                        output_format = 'html_document',
                        output_file = '07_combined_annotation_across_samples_exploration.html',
-                        output_dir = '${notebook_dir}')"
+                        output_dir = '${notebook_dir}')"


GitHub likes having new lines at the end of files, and it loves to remind us with this aggressive red symbol! 😃

Suggested change

output_dir = '${notebook_dir}')"

output_dir = '${notebook_dir}')"

sjspielman · 2025-01-30T15:17:40Z