A collections of functions to facilitate analysis of HiC data based on the cooler and cooltools interfaces.
assignRegions(window, binsize, chroms, positions, arms)
: Constructs a 2d region around a series of chromosomal location.
Window specifies the windowsize for the constructed regions. The total region
assigned will be pos-window until pos+window. The binsize specifies the size
of the HiC bins. The positions which represent the center of the regions
is givin the the chroms series and the positions series.
assignRegions2d(window, binsize, chroms1, positions1, chroms2, positions2, arms)
: Constructs a 2d region around a series of chromosomal location pairs.
Window specifies the windowsize for the constructed regions. The total region
assigned will be pos-window until pos+window. The binsize specifies the size
of the HiC bins. The positions which represent the center of the regions
is given by the chroms1 and chroms2 series as well as the
positions1 and positions2 sereis.
doPileupICCF(clr, snipping_windows, proc=5, collapse=True)
: Takes a cooler file handle and snipping windows constructed
by assignRegions and performs a pileup on all these regions
based on the corrected HiC counts. Returns a numpy array
that contains averages of all selected regions. The collapse
parameter specifies whether to return
the average window over all piles (collapse=True), or the individual
windows (collapse=False).
doPileupObsExp(clr, expected_df, snipping_windows, proc=5, collapse=True)
: Takes a cooler file handle, an expected dataframe
constructed by getExpected, snipping windows constructed
by assignRegions and performs a pileup on all these regions
based on the obs/exp value. Returns a numpy array
that contains averages of all selected regions.
The collapse parameter specifies whether to return
the average window over all piles (collapse=True), or the individual
windows (collapse=False).
downSamplePairs(sampleDict, Distance=10000)
: Will downsample cis and trans reads in sampleDict to contain
as many combined cis and trans reads as the sample with the lowest readnumber of the
specified distance.
getArmsHg19()
: Downloads the coordinates for chromosomal arms of the
genome assembly hg19 and returns it as a dataframe.
getDiagIndices(arr)
: Helper function that returns the indices of the diagonal
of a given array into a flattened representation of the array.
For example, the 3 by 3 array:
[0, 1, 2]
[3, 4, 5]
[6, 7, 8]
would have diagonal indices [0, 4, 8].
getExpected(clr, arms, proc=20, ignoreDiagonals=2)
: Takes a clr file handle and a pandas dataframe
with chromosomal arms (generated by getArmsHg19()) and calculates
the expected read number at a certain genomic distance.
The proc parameters defines how many processes should be used
to do the calculations. ingore_diags specifies how many diagonals
to ignore (0 mains the main diagonal, 1 means the main diagonal
and the flanking tow diagonals and so on)
getPairingScore(clr, windowsize=40000, func=<function mean>, regions=Empty DataFrame Columns: [] Index: [], norm=True, blankDiag=True)
: Takes a cooler file (clr),
a windowsize (windowsize), a summary
function (func) and a set of genomic
regions to calculate the pairing score
as follows: A square with side-length windowsize
is created for each of the entries in the supplied genomics
regions and the summary function applied to the Hi-C pixels
at the location in the supplied cooler file. The results are
returned as a dataframe. If no regions are supplied, regions
are constructed for each bin in the cooler file to
construct a genome-wide pairing score. Norm refers to whether the median of the
calculated pairing score should be subtracted from the supplied vlaues and blankDiga
refers to whether the diagonal should be blanked before calculating pairing score.
getPairingScoreObsExp(clr, expected, windowsize=40000, func=<function mean>, regions=Empty DataFrame Columns: [] Index: [], norm=True)
: Takes a cooler file (clr), an expected dataframe (expected; maybe generated by getExpected),
a windowsize (windowsize), a summary
function (func) and a set of genomic
regions to calculate the pairing score
as follows: A square with side-length windowsize
is created for each of the entries in the supplied genomics
regions and the summary function applied to the Hi-C pixels (obs/exp values)
at the location in the supplied cooler file. The results are
returned as a dataframe. If no regions are supplied, regions
are constructed for each bin in the cooler file to
construct a genome-wide pairing score.
loadPairs(path)
: Function to load a .pairs or .pairsam file
into a pandas dataframe.
This only works for relatively small files!
pileToFrame(pile)
: Takes a pile of pileup windows produced
by doPileupsObsExp/doPileupsICCF (with collapse set to False;
this is numpy ndarray with the following dimensions:
pile.shape = [windoSize, windowSize, windowNumber])
and arranges them as a dataframe with the pixels of the
pile flattened into columns and each individual window
being a row.
slidingDiamond(array, sideLen=6, centerX=True)
: Will slide a dimaond of side length 'sideLen'
down the diagonal of the passed array and return
the average values for each position and
the relative position of each value with respect
to the center of the array (in Bin units)