Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Estimate centrality

Gabriele Girelli edited this page Aug 23, 2018 · 13 revisions

The gpseqc_estimate script allows to estimate regional nuclear centrality based on a multi-condition GPSeq experiment. Run gpseqc_estimate -h for more details on the parameters, and gpseqc_estimate -H for a more readable description of the pipeline.

Minimum options

The minimum input of the gpseqc_estimate script are an output directory (specified with the -o option) and at least 2 (or 3 when using -n) bedfiles that should be provided in order of increasing conditions of digestion.

Binning

  • For chromosome wide bins: use the default, no need to change anything.
  • For sub-chromosomal non-overlapping bins: set the size of your bins using the -s option.
  • For sub-chromosomal overlapping bins: set the size and step of your bins with the -s and -p options respectively.
  • To estimate centrality of a custom list of bins: provide a bed files with the bins of interest using the -b option.

Coverage considerations - grouping

If you do not sequence with deep coverage, it is advisable to group the reads into bins (groups) that will be used as cutsites. Specify the size of the groups with the -g option.

Normalization

Use the -n option to automatically normalize over the last condition (expected to be at least over-night, i.e., assumed to approximate accessibility).

Outlier removal

Outliers (cutsites with an abnormal number of unique reads) in the input bed files can be automatically removed.

Use the --bed-outlier flag to specify the outlier detection method: Z (Z-score), t (t-student, as Z*sqrt(n-2)/sqrt(n-1-Z^2)), chi2 (chi-square, as Z^2, default), IQR or MAD. Specify the significance level with --bed-alpha (default 0.01) or the limit, for the IQR mode only, with --bed-lim (default: 1.5). Other flags:

  • -k to turn off the outlier detection/filter.
  • -C to remove only outliers common to all bed files, not all outliers.

Masking input/output

It is possible to mask both input or output. To mask the input bed files, use the -m option to provide a bed file with the regions to be masked. To mask the output files, provide a similar bed file with the -M option. In the latter case, the mask is applied only to the estimated.*.tsv, ranked.*.tsv and rescaled.*.tsv output files, not to the combined.*.tsv one.

Which cutsites to consider? (cutsite domain)

To change the cutsites considered in the analysis, use the -c option and select one of the following:

  1. Universe: all cutsites in the genome are considered. Requires a bed file containing the cutsite locations, provided with the -l option.
  2. Union: cutsites restricted in any of the conditions are considered.
  3. Separated (default): the cutsite domain is condition specific and includes all the cutsites restricted in that condition.
  4. Intersection: only cutsites restricted in all conditions are considered.

Additional settings

  • -t to provide a number of threads to be used for parallelization (default: 1).
  • -r and -u to proived respectively a prefix/suffix to the output files.
  • -d to trigger debug mode.
  • -e to specify a list of scores not to be calculated.
  • -i to specify the only scores that should be calculated.
  • -T to provide the path to the temporary folder. If the temporary folder is located on a solid state drive the pipeline will take shorter times than with normal drives.
  • -O to select the outlier detection method. By default, only outliers common to all the provided bed files are removed.