Skip to content
GuyKha edited this page Oct 10, 2016 · 21 revisions

General

The ChARM_v4.m file in WGSeq and RADseq is initiated by processing2.m during dataset installation.
The algorithm is an implementation of the algorithm as descirbed in the article (with multiple changes):
Accurate detection of aneuploidies in array CGH and gene expression microarray data

Note: Italic text represents changes from the original paper.

The algorithm use the following default values: (These values appear to come from the original code but are not stated in the article)

% default values from ChARMRunner.java.   Values in brackets represent values chosen for Seq data.
median_window_width   = 9;		% Median filter window size.        [9]
smooth_window_width   = 9;		% Smoothing filter window size.     [9]
smooth_gaussian_sigma = 4;		% Sigma factor for Gaussian-kernal smoothing.

adjacent_dist         = 2;		% The minimum index (in genes) between any two initial edge estimates from filtering stage.
init_snr              = 0.5;		% Edge removal SNR criteria BEFORE edge convergence.
max_snr               = 0.75;		% Edge removal SNR criteria AFTER edge convergence.
percent_edges         = 0.2;		% The number of initial edge estimates reported from filtering stage: num_estimates = percent_edges*chromosome_length.
num_permutations      = 20;		% Number of permutations required for mean permutation test (with Gaussian approximation). [200]
merge_snr_thresh      = 1.5;		% (?)

% default values from Edge.java
max_ROI               = 20;		% The maximum allowable radius of influence (i.e. the number of genes on either side of an edge that influence edge placement). [20]
percent_window_size   = 0.5;		% The radius of influence is determined as follows: min(percent_window_size*window_size,max_ROI).
start_likelihood      = -1000000;	% Starting likelihood value for EM iterations.
min_window            = 2;		% Minimum allowable window size (windows smaller than this are discarded during EM iterations).

edge_nearness_limit   = 2;		% Edge pairs closer than this are converted to a single edge.
SNR_threshold         = 4;		% Signal to Noise threshold for identifying side windows as the same.

Detailed explanation of the steps in ChARM algorithm

  1. Loading CNVplot2 variable - the variable contains a cell for each chromosome and each chromosome cell contains a vector in the length of the number of bins in that chromosome. Each value in the vector is the average read count for the current bin divided by the median of the read count across all of the bins in all of the chromosomes taking into account the biases if a bias was chosen.
    The variable is loaded from Common_CNV.mat that was created by processing1.m (in CNV_v6_6.m)
  2. Edge detection filter - the following filters are been applied to CNVplot2 according to the article:
    • Median filter
    • Smoothing filter
    • Differentiation Filter
  3. Applying another smoothing filter after diffrentation - this was added by darren and doesn't appear in the article.
  4. Finding local maxima & minima - for each chromosome local maximum and minimum is found by the following function:
function [pks,locs] = findpeaks(X);
%
% From comment at: http://stackoverflow.com/questions/4376636/unexpected-behaviour-of-function-findpeaks-in-matlabs-signal-processing-toolbox
%
% This script should not be used for data with extensive flat regions.
% Before examination, input data vector should be smoothed sufficiently to minimize small flat regions.
%
	M    = numel(X);
	pks  = [];
	locs = [];
	if (M < 4)
		datamsgid = generatemsgid('emptyDataSet');
		error(datamsgid,'Data set must contain at least 4 samples.');
	else
		for idx = 1:M-3
			if ((X(idx) < X(idx+1)) && (X(idx+1) >= X(idx+2)) && (X(idx+2) > X(idx+3)))
				pks = [pks X(idx)];
				locs = [locs idx];
			end
		end
	end
end

the local minimum and maximum for each chromosome is been saved to a variable named locs where each char has a cell
5. Generating temp figures - the following figures are been generated (if temp_figures variable is set to true) fig.ChARM_test.1 - the basic CNV data for each chromosome, fig.ChARM_test.2 - the CNV data after median filter, fig.ChARM_test.3 the CNV data after smoothing, fig.ChARM_test.4 - the CNV data after diffrentation and second smoothing, with blue bars for maximum points
6. Expectation-Maximization Edge-Placement Algorithm - preformed as described in the article, _according to the article this part should do number of iterations, but currently only one iteration is been done. _
7. Window Similarity test - According to the article this step should be done by calculating SNR and determining SNR threshold that will change after number of iterations until convergence. the current implementation calculate the SNR and doesn't include the convergence code.
instead the following steps are been done: (edge means suspected breakpoints)

  • Converting edge pairs that are too close into single edges - if there are edges with 1 bin space between them then they will be converted to one by saving only the higher edge position. (unless it is the first edge, if so the higher edge will be removed)
  • med_delta calculation - running over all of the edges inner edges (meaning not the first and the last) and finding for each edge the median of it's adjacent windows (the adjacent windows are defined by the next and previous edges) then save the difference into a variable named med_delta (where each edge has a cell) then the first value in med_delta is been overwrite with 1000 (extremely large value) and a 1000 value is also been added at the end.
  • Removing edges - sorting med_delta by size (lower first) and then removing the current edge if the delta between it's windows is lower then 0.25 (which translate to lower the 25% more then the median across all of the chromosomes) this is been done in a loop to allow multiple removals if number of edges are below the thresold. these 3 steps are done in a until we passed over all of the edges for each chromosome.

Window significance analysis - this step is described in the article but it's not implemented.

  1. Drawing final temp figure - drawing a figure with the base CNV and blue lines for each of the edges
  2. saving to common_ChARM.mat - segmental_aneuploidy variable which contain for each chromosome the location of the break in precentages.

This data is later been loaded by processing4.m when creating the figures of CNV and SNP and there segments with the same rounded ploidy (0.5 is rounded to 1) are merged also and that is the final segmentation.