-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Move to folder ATAC-seq, and create folders to store bigBed data files and peaks analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq. Retrieve from a newly generated metadata file ATAC-seq peaks (bigBed narrow, pseudoreplicated peaks, assembly GRCh38) for stomach and sigmoid_colon for the same donor used in the previous sections. Hint: have a look at what we did here. Make sure your md5sum values coincide with the ones provided by ENCODE.
IMPORTANT: if you are running your analyses on a Mac, exit the Docker container to download the metadata file. Re-enter the container right after to perform the rest of the analyses.
- For each tissue, run an intersection analysis using BEDTools: report 1) the number of peaks that intersect promoter regions, 2) the number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions). Hint: have a look at what we did here and here.
Run the following docker container
sudo docker run -v $PWD:$PWD -w $PWD --rm -it dgarrimar/epigenomics_course
Create a directories into ATAC directory:
cd epigenomics_uvic
mkdir analyses annotation data
To download the metadata file, use this command:
../bin/download.metadata.sh "https://www.encodeproject.org/metadata/?replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&status=submitted&status=in+progress&assay_title=ATAC-seq&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&type=Experiment&files.analyses.status=released&files.preferred_default=true"
Download bigBed peak calling.
Prepare a folder to store these types of data.
mkdir data/bigBed.files
Again, we will first parse the metadata file to retrieve the corresponding IDs for ATAC-seq, and then download the files in the corresponding folders (bigBed.files).
- bigBed peak calling files (ATAC-seq, bigBed narrow, pseudoreplicated peaks, assembly GRCh38, most recent file for each tissue):
grep -F ATAC-seq metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt
cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
Check the integrity of the downloaded files:
for file_type in bigBed; do
# retrieve original MD5 hash from the metadata
../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt
# compute MD5 hash on the downloaded files
cat data/"$file_type".files/md5sum.txt |\
while read filename original_md5sum; do
md5sum data/"$file_type".files/"$filename"."$file_type" |\
awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}'
done > tmp
mv tmp data/"$file_type".files/md5sum.txt
# make sure there are no files for which original and computed MD5 hashes differ
awk '$2!=$3' data/"$file_type".files/md5sum.txt
done
Genes with peaks of ATAC-seq in each tissue
- Create a folder peaks.analysis inside analyses
mkdir analyses/peaks.analysis
mkdir data/bed.files
cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done
- Download from here (https://public-docs.crg.es/rguigo/Data/bborsari/UVIC/epigenomics_course/gencode.v24.protein.coding.non.redundant.TSS.bed) the list of promoters ([-2 kb, +2 Kb] from TSS) of protein-coding genes. Store this file inside the annotation folder.
Retrieve genes with ATAC peaks at the promoter region in each tissue.
cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do
bedtools intersect -a data/bed.files/"$filename".bed -b ../ChIP-seq/annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -u > analyses/peaks.analysis/genes.with.peaks."$tissue".ATAC.bed
done
wc -l analyses/peaks.analysis/genes.with.peaks.*.ATAC.bed
47871 genes.with.peaks.sigmoid_colon.ATAC.txt 44749 genes.with.peaks.stomach.ATAC.txt 92620 total
Create a folder annotation and download there the gencode.v24.primary_assembly.annotation file.
mkdir annotation
wget -P annotation "https://www.encodeproject.org/files/gencode.v24.primary_assembly.annotation/@@download/gencode.v24.primary_assembly.annotation.gtf.gz"
Uncompress the gtf.gz file:
gunzip annotation/gencode.v24.primary_assembly.annotation.gtf.gz
Prepare a BED file with fall outside gene coordinates:
awk '$3=="gene"' annotation/gencode.v24.primary_assembly.annotation.gtf |\
grep -F "protein_coding" |\
cut -d ";" -f1 |\
awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $10, 0, $7, $10}' |\
sed 's/\"//g' |\
awk 'BEGIN{FS=OFS="\t"}$1!="chrM"{$2=($2-1); print $0}' > annotation/gencode.v24.protein.coding.gene.body.bed
Retrieve genes with ATAC peaks at fall outside gene coordinates in each tissue.
cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do
bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.gene.body.bed -v > analyses/peaks.analysis/outside.genes.with.peaks."$tissue".ATAC.bed
done
wc -l analyses/peaks.analysis/outside.genes.with.peaks.*.ATAC.bed
37035 outside.genes.with.peaks.sigmoid_colon.ATAC.bed 34537 outside.genes.with.peaks.stomach.ATAC.bed 71572 total
Create a folder regulatory_elements inside epigenomics_uvic. This will be the folder where you store all your subsequent results.
mkdir regulatory_elements
Distal regulatory regions are usually found to be flanked by both H3K27ac and H3K4me1. From your starting catalogue of open regions in each tissue, select those that overlap peaks of H3K27ac AND H3K4me1 in the corresponding tissue. You will get a list of candidate distal regulatory elements for each tissue. How many are they?
Create a directories into regulatory_elements directory:
cd regulatory_elements
mkdir analyses data
mkdir data/bigBed.files
Download peak calling BigBed peak calling files (H3K27ac and H3K4me1, bigBed narrow, pseudoreplicated peaks, assembly GRCh38, most recent file for each tissue):
- H3K27ac:
grep -E H3K27ac ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt
awk '$3=="H3K27ac-human"{print $1}' analyses/bigBed.peaks.ids.txt |\
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
- H3K4me1:
grep -E H3K4me1 ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u >> analyses/bigBed.peaks.ids.txt
awk '$3=="H3K4me1-human"{print $1}' analyses/bigBed.peaks.ids.txt |\
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
Check the integrity of the downloaded files:
for file_type in bigBed; do
# retrieve original MD5 hash from the metadata
../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) ../ChIP-seq/metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt
# compute MD5 hash on the downloaded files
cat data/"$file_type".files/md5sum.txt |\
while read filename original_md5sum; do
md5sum data/"$file_type".files/"$filename"."$file_type" |\
awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}'
done > tmp
mv tmp data/"$file_type".files/md5sum.txt
# make sure there are no files for which original and computed MD5 hashes differ
awk '$2!=$3' data/"$file_type".files/md5sum.txt
done
Convert the bigBed files to BED files:
cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done
Select those that overlap peaks of H3K27ac AND H3K4me1 in the corresponding tissue:
bedtools intersect -a data/bed.files/ENCFF872UHN.bed -b data/bed.files/ENCFF724ZOF.bed -u > analyses/peaks.analysis/peaks.sigmoid_colon.DRE.bed
bedtools intersect -a data/bed.files/ENCFF977LBD.bed -b data/bed.files/ENCFF844XRN.bed -u > analyses/peaks.analysis/peaks.stomach.DRE.bed
Intersect the previous ATAC output:
bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/outside.genes.with.peaks.sigmoid_colon.ATAC.bed -b analyses/peaks.analysis/peaks.sigmoid_colon.DRE.bed -u > analyses/peaks.analysis/peaks.sigmoid_colon.DRE.filter.bed
bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/outside.genes.with.peaks.stomach.ATAC.bed -b analyses/peaks.analysis/peaks.stomach.DRE.bed -u > analyses/peaks.analysis/peaks.stomach.DRE.filter.bed
How many are they?
wc -l analyses/peaks.analysis/*filter.bed
15570 peaks.sigmoid_colon.DRE.filter.bed 9966 /peaks.stomach.DRE.filter.bed 25536 total
Focus on regulatory elements that are located on chromosome 1 (hint: to parse a file based on the value of a specific column, have a look at what we did here), and generate a file regulatory.elements.starts.tsv that contains the name of the regulatory region (i.e. the name of the original ATAC-seq peak) and the start (5') coordinate of the region.
The following code can be used to examine the columns:
head -1 analyses/peaks.analysis/peaks.sigmoid_colon.DRE.filter.bed|\
awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}'
As we can see, the start position is in column 2, while column 1 has the chromosome number. Now that we have these two columns, we can construct the tsv file (one file for each tissue):
for tissue in sigmoid_colon stomach; do
awk 'BEGIN{FS=OFS="\t"}$1=="chr1"{print $1, $2}' analyses/peaks.analysis/peaks."$tissue".DRE.filter.bed > analyses/regulatory.elements.starts."$tissue".tsv
done
Focus on protein-coding genes located on chromosome 1. From the BED file of gene body coordinates that you generated here, prepare a tab-separated file called gene.starts.tsv which will store the name of the gene in the first column, and the start coordinate of the gene on the second column (REMEMBER: for genes located on the minus strand, the start coordinate will be at the 3'). Use the command below as a starting point: For each peak we need to identify the colosest protein coding gene.
grep -w chr1 ../ATAC-seq/annotation/gencode.v24.protein.coding.gene.body.bed |\
awk 'BEGIN{FS=OFS="\t"}{if ($6=="+"){start=$2} else {start=$3}; print $4, start}' > analyses/gene.starts.tsv
Download this python script inside the epigenomics_uvic/bin folder. Have a look at the help page of this script to understand how it works.
wget -P ../bin https://public-docs.crg.es/rguigo/Data/bborsari/UVIC/epigenomics_course/get.distance.py
python ../bin/get.distance.py -h
python ../bin/get.distance.py -h
Usage: get.distance.py [options]
Options: -h, --help show this help message and exit -i INPUT, --input=INPUT -s START, --start=START
Check that it works
python get.distance.py --input regulatory_elements/gene.starts.tsv --start 980000
0 1000000
After failing to get the desired outcome, I decided to complete the script:
nano get.distance.py
for line in open_input.readlines(): # for each line in the input file gene, position = line.strip().split('\t') # split the line into two columns based on a tab position = int(position) # define a variable called position that correspond to the integer of the start of the gene distance = abs(position - enhancer_start) # compute the absolute value of the difference between position and enhancer_start
if distance < x: # if this absolute value is lower than x x = distance # this value will now be your current x selectedGene = gene # save gene as selectedGene selectedGeneStart = position # save position as selectedGeneStart
For each regulatory element contained in the file regulatory.elements.starts.tsv, retrieve the closest gene and the distance to the closest gene using the python script you created above. Use the command below as a starting point.
for tissue in sigmoid_colon stomach; do
cat analyses/regulatory.elements.starts.$tissue.tsv | while read element start; do
python ../bin/get.distance.py --input analyses/gene.starts.tsv --start $start
done > analyses/regulatoryElements.genes.distances.$tissue.tsv
done
Use R to compute the mean and the median of the distances stored in regulatoryElements.genes.distances.tsv.
Go folder analyses/
cd analyses/
Open R
R
- Mean distance.sigmoid_colon <- read.delim("regulatoryElements.genes.distances.sigmoid_colon.tsv", header = FALSE) mean(distance.sigmoid_colon[,3])
[1] 71457.05
distance.stomach <- read.delim("regulatoryElements.genes.distances.stomach.tsv", header = FALSE) mean(distance.stomach[,3])
[1] 44981.41
- Median median(distance.sigmoid_colon[,3])
[1] 35021
median(distance.stomach[,3])
[1] 27004