Skip to content

permutation based statistics

Neil Horner edited this page Jun 24, 2021 · 26 revisions

Introduction

todo
  • A minimum number of lines and baselines required for this method needs to be worked out.

From wikipedia

A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points

Motivation

In LAMA, permutation testing has been implemented for the organ volume analysis only. This is because the voxel-based data contains many millions of data points that would be too computationally expensive. Organ volume data on the other hand has only n number of data points (where n is the total number of labels in the atlas).

Pipeline overview

The following procedure is performed per organ label

Generate null-distribution
  • For each mutant line, obtain the mutant sample number (n)
  • Relabel n baselines as synthetic mutants
  • Do multiple linear regression - organ volume ~ genotype + whole embryo volume
  • Obtain p-value for genotype effect and add to null distribution
  • repeat until desired number of permutations done
Generate alternative distribution
  • For each mutant line, do regression as above
  • Obtain p-value for genotype effect and add to alternative distribution
Calculate false discovery rate (FDR)

Search for a p-value cutoff where: proportion of null test statistics under the threshold / proportion of alternative test statistics under threshold is < 0.05

This gives us our FDR for that organ. i.e. how many times we expect to get a false positive result from across all our mutant lines.

Running the script

After running registration using job_runner you should have 2 folders in your output directory

  1. baseline

  2. mutant

Each will have an output folder containing individual lines (in the mutant instance) or a single folder named 'baseline' in the case of baselines.

The individual specimen folders with each contain the following csv files that hold information on organ_volumes organ_volumes.csv - The organ volume in voxels for each label present in the label map staging_info_volume.csv - The whole embryo volume based on the mask supplied during registration (stats_mask)

Open a terminal and do the following

$ lama_permutation_stats.py -c config.yaml

These are the available options for the config

required arguments

wildtype_dir: wild type registration output directory
mutant_dir: mutant registration utput directory

optional arguments

output_dir: output directory [default is this config's parent directory] n_permutations: number of permutations [default 1000]
label_metadata: atlas metadata file. see input data
label_map: atlas/label map. see input data
norm_to_whole_embryo_vol: noramlise embryo volume to whole embry volume before fitting to linear model [default False]
qc_file: qc_file. A CSV file indicating labels from specimens that should be excluded from the analysis. voxel_size: Voxel size of the input data [default 1]

    columns:
    - id: the specimen id
    - line: the line id
    - label: the label to exclude (int)
    - label_name (optional)

TODO: Desrcibe the output