Skip to content
Antonio Gómez-Martín edited this page Apr 11, 2023 · 1 revision

TASK 4. EN‐TEx ATAC‐seq data: downstream analyses

  • Move to folder ATAC-seq, and create folders to store bigBed data files and peaks analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq. Retrieve from a newly generated metadata file ATAC-seq peaks (bigBed narrow, pseudoreplicated peaks, assembly GRCh38) for stomach and sigmoid_colon for the same donor used in the previous sections. Hint: have a look at what we did here. Make sure your md5sum values coincide with the ones provided by ENCODE.

IMPORTANT: if you are running your analyses on a Mac, exit the Docker container to download the metadata file. Re-enter the container right after to perform the rest of the analyses.

  • For each tissue, run an intersection analysis using BEDTools: report 1) the number of peaks that intersect promoter regions, 2) the number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions). Hint: have a look at what we did here and here.

Run the following docker container

sudo docker run -v $PWD:$PWD -w $PWD --rm -it dgarrimar/epigenomics_course

Create a directories into ATAC directory:

cd epigenomics_uvic
mkdir analyses annotation data

To download the metadata file, use this command:

../bin/download.metadata.sh "https://www.encodeproject.org/metadata/?replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&status=submitted&status=in+progress&assay_title=ATAC-seq&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&type=Experiment&files.analyses.status=released&files.preferred_default=true"

Download bigBed peak calling.

Prepare a folder to store these types of data.

mkdir data/bigBed.files

Again, we will first parse the metadata file to retrieve the corresponding IDs for ATAC-seq, and then download the files in the corresponding folders (bigBed.files).

  • bigBed peak calling files (ATAC-seq, bigBed narrow, pseudoreplicated peaks, assembly GRCh38, most recent file for each tissue):
grep -F ATAC-seq metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt

cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
  wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done

Check the integrity of the downloaded files:

for file_type in bigBed; do

  # retrieve original MD5 hash from the metadata
  ../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt

  # compute MD5 hash on the downloaded files 
  cat data/"$file_type".files/md5sum.txt |\
  while read filename original_md5sum; do 
    md5sum data/"$file_type".files/"$filename"."$file_type" |\
    awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' 
  done > tmp 
  mv tmp data/"$file_type".files/md5sum.txt

  # make sure there are no files for which original and computed MD5 hashes differ
  awk '$2!=$3' data/"$file_type".files/md5sum.txt

done

Genes with peaks of ATAC-seq in each tissue

  • Create a folder peaks.analysis inside analyses
mkdir analyses/peaks.analysis
mkdir data/bed.files

cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
  bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done

Task 4.1. Number of peaks that intersect promoter regions:

Retrieve genes with ATAC peaks at the promoter region in each tissue.

cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do 
  bedtools intersect -a data/bed.files/"$filename".bed -b ../ChIP-seq/annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -u > analyses/peaks.analysis/genes.with.peaks."$tissue".ATAC.bed
done
wc -l analyses/peaks.analysis/genes.with.peaks.*.ATAC.bed

47871 genes.with.peaks.sigmoid_colon.ATAC.txt 44749 genes.with.peaks.stomach.ATAC.txt 92620 total

Task 4.2. Number of peaks that fall outside gene coordinates:

Create a folder annotation and download there the gencode.v24.primary_assembly.annotation file.

mkdir annotation
wget -P annotation "https://www.encodeproject.org/files/gencode.v24.primary_assembly.annotation/@@download/gencode.v24.primary_assembly.annotation.gtf.gz"

Uncompress the gtf.gz file:

gunzip annotation/gencode.v24.primary_assembly.annotation.gtf.gz

Prepare a BED file with fall outside gene coordinates:

awk '$3=="gene"' annotation/gencode.v24.primary_assembly.annotation.gtf |\
grep -F "protein_coding" |\
cut -d ";" -f1 |\
awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $10, 0, $7, $10}' |\
sed 's/\"//g' |\
awk 'BEGIN{FS=OFS="\t"}$1!="chrM"{$2=($2-1); print $0}' > annotation/gencode.v24.protein.coding.gene.body.bed

Retrieve genes with ATAC peaks at fall outside gene coordinates in each tissue.

cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do 
  bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.gene.body.bed -v > analyses/peaks.analysis/outside.genes.with.peaks."$tissue".ATAC.bed
done
wc -l analyses/peaks.analysis/outside.genes.with.peaks.*.ATAC.bed

37035 outside.genes.with.peaks.sigmoid_colon.ATAC.bed 34537 outside.genes.with.peaks.stomach.ATAC.bed 71572 total

5. Distal regulatory activity

Task 5.1

Create a folder regulatory_elements inside epigenomics_uvic. This will be the folder where you store all your subsequent results.

mkdir regulatory_elements

Task 5.2

Distal regulatory regions are usually found to be flanked by both H3K27ac and H3K4me1. From your starting catalogue of open regions in each tissue, select those that overlap peaks of H3K27ac AND H3K4me1 in the corresponding tissue. You will get a list of candidate distal regulatory elements for each tissue. How many are they?

Create a directories into regulatory_elements directory:

cd regulatory_elements
mkdir analyses data
mkdir data/bigBed.files

Download peak calling BigBed peak calling files (H3K27ac and H3K4me1, bigBed narrow, pseudoreplicated peaks, assembly GRCh38, most recent file for each tissue):

  • H3K27ac:
grep -E H3K27ac ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt

awk '$3=="H3K27ac-human"{print $1}' analyses/bigBed.peaks.ids.txt |\
while read filename; do 
  wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
  • H3K4me1:
grep -E H3K4me1 ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u >> analyses/bigBed.peaks.ids.txt

awk '$3=="H3K4me1-human"{print $1}' analyses/bigBed.peaks.ids.txt |\
while read filename; do 
  wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done

Check the integrity of the downloaded files:

for file_type in bigBed; do

  # retrieve original MD5 hash from the metadata
  ../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) ../ChIP-seq/metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt

  # compute MD5 hash on the downloaded files 
  cat data/"$file_type".files/md5sum.txt |\
  while read filename original_md5sum; do 
    md5sum data/"$file_type".files/"$filename"."$file_type" |\
    awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' 
  done > tmp 
  mv tmp data/"$file_type".files/md5sum.txt

  # make sure there are no files for which original and computed MD5 hashes differ
  awk '$2!=$3' data/"$file_type".files/md5sum.txt

done

Convert the bigBed files to BED files:

cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
  bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done

Select those that overlap peaks of H3K27ac AND H3K4me1 in the corresponding tissue:

  bedtools intersect -a data/bed.files/ENCFF872UHN.bed -b data/bed.files/ENCFF724ZOF.bed -u > analyses/peaks.analysis/peaks.sigmoid_colon.DRE.bed

  bedtools intersect -a data/bed.files/ENCFF977LBD.bed -b data/bed.files/ENCFF844XRN.bed -u > analyses/peaks.analysis/peaks.stomach.DRE.bed

Intersect the previous ATAC output:

  bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/outside.genes.with.peaks.sigmoid_colon.ATAC.bed -b analyses/peaks.analysis/peaks.sigmoid_colon.DRE.bed -u > analyses/peaks.analysis/peaks.sigmoid_colon.DRE.filter.bed
  
  bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/outside.genes.with.peaks.stomach.ATAC.bed -b analyses/peaks.analysis/peaks.stomach.DRE.bed -u > analyses/peaks.analysis/peaks.stomach.DRE.filter.bed

How many are they?

wc -l analyses/peaks.analysis/*filter.bed

15570 peaks.sigmoid_colon.DRE.filter.bed 9966 /peaks.stomach.DRE.filter.bed 25536 total

Task 5.3

Focus on regulatory elements that are located on chromosome 1 (hint: to parse a file based on the value of a specific column, have a look at what we did here), and generate a file regulatory.elements.starts.tsv that contains the name of the regulatory region (i.e. the name of the original ATAC-seq peak) and the start (5') coordinate of the region.

The following code can be used to examine the columns:

head -1 analyses/peaks.analysis/peaks.sigmoid_colon.DRE.filter.bed|\
awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}'

As we can see, the start position is in column 2, while column 1 has the chromosome number. Now that we have these two columns, we can construct the tsv file (one file for each tissue):

for tissue in sigmoid_colon stomach; do
  awk 'BEGIN{FS=OFS="\t"}$1=="chr1"{print $1, $2}' analyses/peaks.analysis/peaks."$tissue".DRE.filter.bed > analyses/regulatory.elements.starts."$tissue".tsv
done

Task 5.4

Focus on protein-coding genes located on chromosome 1. From the BED file of gene body coordinates that you generated here, prepare a tab-separated file called gene.starts.tsv which will store the name of the gene in the first column, and the start coordinate of the gene on the second column (REMEMBER: for genes located on the minus strand, the start coordinate will be at the 3'). Use the command below as a starting point: For each peak we need to identify the colosest protein coding gene.

grep -w chr1 ../ATAC-seq/annotation/gencode.v24.protein.coding.gene.body.bed |\
awk 'BEGIN{FS=OFS="\t"}{if ($6=="+"){start=$2} else {start=$3}; print $4, start}' > analyses/gene.starts.tsv

Task 5.5

Download this python script inside the epigenomics_uvic/bin folder. Have a look at the help page of this script to understand how it works.

wget -P ../bin https://public-docs.crg.es/rguigo/Data/bborsari/UVIC/epigenomics_course/get.distance.py
python ../bin/get.distance.py -h

python ../bin/get.distance.py -h

Usage: get.distance.py [options]

Options: -h, --help show this help message and exit -i INPUT, --input=INPUT -s START, --start=START

Check that it works

python get.distance.py --input regulatory_elements/gene.starts.tsv --start 980000
    0       1000000

After failing to get the desired outcome, I decided to complete the script:

nano get.distance.py

for line in open_input.readlines(): # for each line in the input file gene, position = line.strip().split('\t') # split the line into two columns based on a tab position = int(position) # define a variable called position that correspond to the integer of the start of the gene distance = abs(position - enhancer_start) # compute the absolute value of the difference between position and enhancer_start

if distance < x: # if this absolute value is lower than x x = distance # this value will now be your current x selectedGene = gene # save gene as selectedGene selectedGeneStart = position # save position as selectedGeneStart

Task 5.6

For each regulatory element contained in the file regulatory.elements.starts.tsv, retrieve the closest gene and the distance to the closest gene using the python script you created above. Use the command below as a starting point.

for tissue in sigmoid_colon stomach; do
cat analyses/regulatory.elements.starts.$tissue.tsv | while read element start; do 
   python ../bin/get.distance.py --input analyses/gene.starts.tsv --start $start
done > analyses/regulatoryElements.genes.distances.$tissue.tsv
done

Task 5.7

Use R to compute the mean and the median of the distances stored in regulatoryElements.genes.distances.tsv.

Go folder analyses/

cd analyses/

Open R

R
  • Mean distance.sigmoid_colon <- read.delim("regulatoryElements.genes.distances.sigmoid_colon.tsv", header = FALSE) mean(distance.sigmoid_colon[,3])

[1] 71457.05

distance.stomach <- read.delim("regulatoryElements.genes.distances.stomach.tsv", header = FALSE) mean(distance.stomach[,3])

[1] 44981.41

  • Median median(distance.sigmoid_colon[,3])

[1] 35021

median(distance.stomach[,3])

[1] 27004