https://github.com/marbl/Primates#data-reuse-and-license
Cactus version: v2.7.1
Note that this uses the new --noSplit
option that puts all contigs into one big graph without doing any kind of chromosome decomposition (as was done for the HPRC). Also, we're requesting tons of memory (see log for actual usage). Most processes don't need so much but the distance indexing takes over a terrabyte for this graph.
In terms of graph construction, the 4 orang genomes can be added without much issue (I've tried this on the older assemblies). Not sure about the indexing though.
TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-pangenome ./js-pg ./10-t2t-apes-mc-2023v2.seqfile --outDir 10-t2t-apes-mc-2023v2 --outName 10-t2t-apes-mc-2023v2 --reference hs1 hg38 --noSplit --gbz clip full --gfa clip full --xg clip full --odgi --vcf --giraffe clip --haplo clip --vcfReference hs1 hg38 --logFile 10-t2t-apes-mc-2023v2.log --batchSystem slurm --coordinationDir /data/tmp --caching false --batchLogsDir ./batch-logs --consMemory 1500Gi --indexMemory 1500Gi --mgMemory 500Gi --mgCores 72 --mapCores 8 --consCores 128 --indexCores 72 --giraffe clip
The same processing was used to make the "decomposed" HPRC vcfs. Note that this process will filter out giant SVs as a 100kb threshold was used.
docker run -it --rm -v $(pwd)/:/data --user $(id -u):$(id -g) ghcr.io/pangenome/pggb:202402032147026ffe7f bash -c /data/vcf-bubwave.sh /data/10-t2t-apes-mc-2023v2.vcf.gz /data/10-t2t-apes-mc-2023v2-vcfbub.a100k.wave.vcf.gz
Left-align indels
vg paths -x 10-t2t-apes-mc-2023v2.xg -FQ hs1 | sed -e "s/hs1#0#//g" | bgzip > hs1.fa.gz
samtools faidx hs1.fa.gz
bcftools norm -f hs1.fa.gz 10-t2t-apes-mc-2023v2-vcfbub.a100k.wave.norm.vcf.gz | bcftools sort | bgzip > 10-t2t-apes-mc-2023v2-vcfbub.a100k.wave.norm.vcf.gz
tabix -fp vcf 10-t2t-apes-mc-2023v2-vcfbub.a100k.wave.norm.vcf.gz
halRenameGenomes 10-t2t-apes-mc-2023v2.full.hal rename-hal-to-gca.tsv
for i in hs1 hg38 ; do TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-hal2maf ./js ./10-t2t-apes-mc-2023v2.full.hal ./10-t2t-apes-mc-2023v2.${i}.maf.gz --filterGapCausingDupes --refGenome
Then make a bigmaf (Note: using cactus commit a8bd77e65d7f7c26fd7a6d69a110d1fe23b275c9 for this one -- reproduce with v2.7.2)
for i in hs1 hg38; do TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-maf2bigmaf ./js-bb ./10-t2t-apes-mc-2023v2.${i}.maf.gz ./10-t2t-apes-mc-2023v2.${i}.bigmaf.bb --refGenome
The GCA accessions don't play nicely with the pangenome haplotype naming convetions, the alignment was generated with human-readable names. To flip back to accessions in the HAL
halRenameGenomes 10-t2t-apes-mc-2023v2.full.hal rename-hal-to-gca.tsv
And to go back
halRenameGenomes 10-t2t-apes-mc-2023v2.full.hal rename-hal-from-gca.tsv
For the maf's, use
zcat 10-t2t-apes-mc-2023v2.hs1.maf.gz | sed -e's/hg002.1/GCA_018852605.2/g' \
-e 's/hg002.2/GCA_018852615.2/g' \
-e 's/mPanTro3.1/GCA_028858775.2/g' \
-e 's/mPanTro3.2/GCA_028858805.2/g' \
-e 's/mPanPan1.1/GCA_028858825.2/g' \
-e 's/mPanPan1.2/GCA_028858845.2/g' \
-e 's/mGorGor1.1/GCA_028885475.2/g' \
-e 's/mGorGor1.2/GCA_028885495.2/g' | bgzip > 10-t2t-apes-mc-2023v2.hs1.remamed.maf.gz
and
zcat 10-t2t-apes-mc-2023v2.hs1.renamed.maf.gz | sed -e's/GCA_018852605.2/hg002.1/g' \
-e 's/GCA_018852615.2/hg002.2/g' \
-e 's/GCA_028858775.2/mPanTro3.1/g' \
-e 's/GCA_028858805.2/mPanTro3.2/g' \
-e 's/GCA_028858825.2/mPanPan1.1/g' \
-e 's/GCA_028858845.2/mPanPan1.2/g' \
-e 's/GCA_028885475.2/mGorGor1.1/g' \
-e 's/GCA_028885495.2/mGorGor1.2/g' | bgzip > 10-t2t-apes-mc-2023v2.hs1.remamed.maf.gz