-
Notifications
You must be signed in to change notification settings - Fork 35
STATISTICS
Beside the base statistic function of mean and variance that are highly optimized - the classes stated in corr.pas and statistics.pas try to extend the statisitcs capabilites of this library. Many of the functions are originaly published in nr and converted to Pascal.
The Student T test tests if two sample datasets are significantly different from each other. (see https://en.wikipedia.org/wiki/Student%27s_t-test for further information) There are 3 versions of the Studen T test implemented in this library:
- Unequal sample size but assumed samed variance
- Unequal sample size and assumed different variances
- Equal/"paired" sample size.
The involved functions are:
// Student t test: Returns the probobability in prob and t = (x_A_mean - x_B_mean)/(sD) , sD the "pooled variance".
procedure StudentTTest( data1 : PDouble; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;
procedure StudentTTest( meanData1, varData1 : double; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;
// student t test with significantly different variances
procedure StudentTUTest( data1 : PDouble; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;
procedure StudentTUTest( meanData1, varData1 : double; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;
// student t test for the case of paired samples
procedure StudentTPTest( data1 : PDouble; data2 : PDouble; n : integer; var t, prob : double ); overload;
procedure StudentTPTest( data1 : PDouble; meanData1, varData1 : double; data2 : PDouble; n : integer; var t, prob : double ); overload;
The resulting data is:
- t: The "t" statistics value
- prob: The p value (probability) that two sample sets have the same distribution
Testing for distributions that follow a F distribution. The F Test can be used to check if two distributions have 2 different variances. Check out the wiki page https://en.wikipedia.org/wiki/F-test for more information. The implementation here follows the implementation in NR:
// f-Test for signifacantly different variances
procedure FTest( data1 : PDouble; n1 : integer; data2 : PDouble; n2 : integer; var f, prob : double );
Returned is the F value max(var(data1), var(data2)/min(var(data1), var(data2)) and the probability.
Testing if two distributions are the same input is a binned probability density distribution. As usual Wiki gives a thorough explanation: https://en.wikipedia.org/wiki/Chi-squared_test
Given observed events (bins) and expected distribution (expectedBins) the routine geturns the number of degrees of freedom (df) and the chi-square chsq as well as the significance prob. A small value of prob indicates a signifacant difference between the distrubition bins and expectBins.
3 different routines are implemented:
- ChiSquareOne: Expects as input two binned probability density histogram where the second is a reference (e.g. stems from a calculated normal distribution...)
- ChiSquareTwo: Tests if two binned probability density histograms are the same both are coming from real observations.
- ChiSquareTwoUnequal: Same as above but the input may be unequally sampled.
procedure ChiSquareOne(bins, expectBins : PConstDoubleArr; nBins : integer; numConstraints : integer; var df, chsq, prob : double );
// case of two binned data sets (no expected data known)
procedure ChiSquareTwo(bins1, bins2 : PConstDoubleArr; nBins : integer; numConstraints : integer; var df, chsq, prob : double);
// case of unequal number of elements in both dataset
procedure ChiSquareTwoUnequal(bins1, bins2 : PConstDoubleArr; nBins : integer; numConstraints : integer; var df, chsq, prob : double);
Returns the probability (p-Value) if two distributions are the same as well as the distance (chsq) and the number of degrees of freedom df.
These functions create a pdf (histogram) from the measurements from the input data.
The functiont try to put all values >= maxVal in the last bin and put all others into the bins below. Indexing is done by calculating a step: step := (maxVal - minVal)/(nBins - 1).
procedure Hist( data : PConstDoubleArr; n : integer; histogram : PConstDoubleArr; nBins : integer); overload;
procedure Hist( data : PConstDoubleArr; n : integer; histogram : PConstDoubleArr; nBins : integer; minVal, maxVal : double); overload;
This test is based on the maximum distance of PDF of two unbinned distributions that are functions of a single independent variable. The maximum distance and the probability that both distributions are drawn from the same statistics is returned. 2 different implementations are provided here:
- kolmogorovSmirnovOne: Given a dataset of observations and a function to test against (the function returns a probability based on the input value x) it returns the maximum distance and p value. Can be e.g. used to to compare against a multivariate distribution (e.g. the overall distribution is drawn from 3 normal distribuions).
- kolmogorovSmirnovTwo: This function has as input two datasets and calculates the maximum distance of the pdf's as well as the p value that these are drawn from the same distribution.
Note that the order of the input data is destroyed since the above functions sort the data before further computations.
The library also contains a class that captures the handling of matrix objects and the statistics functions. Many functions are class functions but also many implementations allow to precalculate for one dataset and then compare this dataset against many others. It also allows to calculate different pdf functions (normal, weibull, exponention and many more...)
// ######################################################
// #### Class implementation of various statistic functions
type
TStatTest = class(TBaseMathPersistence)
private
fN : integer;
fMeanVar : TMeanVarRec;
fRefData : IMatrix;
fMinVal, fMaxVal : Double;
fNumBins : integer;
protected
class function ClassIdentifier : String; override;
procedure DefineProps; override;
procedure OnLoadDoubleProperty(const Name : String; const Value : double); override;
procedure OnLoadIntProperty(const Name : String; Value : integer); override;
function OnLoadObject(const Name : String; Obj : TBaseMathPersistence) : boolean; override;
public
// initialize the mean var record with one db so we can compare against others more than once
procedure InitStudentTTest( data1 : TDoubleMatrix );
procedure InitStudentTPTest( data1 : TDoubleMatrix);
procedure InitStudentFTest( data1 : TDoubleMatrix );
procedure InitChisquareTest( data1 : TDoubleMatrix; nBins : integer );
procedure InitKolmogorv2Test( data1 : TDoubleMatrix );
// #########################################
// #### Statistical tests available after the "init" functions (for multiple tests
// against the same data1 source)
procedure StudentTTest( data2 : IMatrix; var t, prob : double ); overload;
procedure StudentTTest( data2 : TDoubleMatrix; var t, prob : double ); overload;
// student t test with significantly different variances
procedure StudentTUTest( data2 : IMatrix; var t, prob : double ); overload;
procedure StudentTUTest( data2 : TDoubleMatrix; var t, prob : double ); overload;
// student t test for the case of paired samples
procedure StudentTPTest( data2 : IMatrix; var t, prob : double ); overload;
procedure StudentTPTest( data2 : TDoubleMatrix; var t, prob : double ); overload;
// f-Test for signifacntly different variances
procedure FTest( data2 : IMatrix; var f, prob : double ); overload;
procedure FTest( data2 : TDoubleMatrix; var f, prob : double ); overload;
// chi square test
procedure ChiSquare1( data2 : IMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;
procedure ChiSquare1( data2 : TDoubleMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;
// chi square2 test
procedure ChiSquare2( data2 : IMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;
procedure ChiSquare2( data2 : TDoubleMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;
// kolmogroovSmirnov2 test
procedure KolmogorovSmirnov2( data2 : IMatrix; var d, prob : double ); overload;
procedure KolmogorovSmirnov2( data2 : TDoubleMatrix; var d, prob : double ); overload;
public
// note that all data is treated as all one big vector!
// student t test with same variances
class procedure StudentTTest( data1, data2 : IMatrix; var t, prob : double ); overload;
class procedure StudentTTest( data1, data2 : TDoubleMatrix; var t, prob : double ); overload;
// student t test with significantly different variances
class procedure StudentTUTest( data1, data2 : IMatrix; var t, prob : double ); overload;
class procedure StudentTUTest( data1, data2 : TDoubleMatrix; var t, prob : double ); overload;
// student t test for the case of paired samples
class procedure StudentTPTest( data1, data2 : IMatrix; var t, prob : double ); overload;
class procedure StudentTPTest( data1, data2 : TDoubleMatrix; var t, prob : double ); overload;
// f-Test for signifacntly different variances
class procedure FTest( data1, data2 : IMatrix; var f, prob : double ); overload;
class procedure FTest( data1, data2 : TDoubleMatrix; var f, prob : double ); overload;
// chi square test on one dimensional datapoints,
class procedure ChiSquare1( data1, data2 : TDoubleMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;
class procedure ChiSquare1( data1, data2 : IMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;
// chi square test on one dimensional datapoints -> no assumption on the expected histogram (data1 in chisquare1) is raised
class procedure ChiSquare2( data1, data2 : TDoubleMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;
class procedure ChiSquare2( data1, data2 : IMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;
class procedure KolmogorovSmirnov1( data : TDoubleMatrix; func : TKSTestFunc; var d, prob : double ); overload;
class procedure KolmogorovSmirnov1( data : IMatrix; func : TKSTestFunc; var d, prob : double ); overload;
// returns the K-S statisticas d and the signifacance level prob for the null hypothesis that the data sets are drawn from the same
// distribution. Small values of prob show that the cumulative distribution function of data1 is significantly different fro mthat of data2
// the arrays data1 and data2 are modified by being sorted into ascending order
class procedure KolmogorovSmirnov2( data1, data2 : TDoubleMatrix; var d, prob : double ); overload;
class procedure KolmogorovSmirnov2( data1, data2 : IMatrix; var d, prob : double ); overload;
// histogram from the input data
// Creates a histogram from the input data between the
// given borders. Elements >= the max value are put into the last bin
// elements < minValue + (maxVal - minVal)/(nbins - 1) into the first bin
// the histogram operates along the given dimension (rowWise is default)
// returns a nbins x height (RowWise True) matrix
class function HistogramIntf( data : IMatrix; nBins : integer; RowWise : Boolean = True ) : IMatrix; overload;
class function HistogramIntf( data : IMatrix; nBins : integer; minVal, maxVal : double; RowWise : Boolean = True ) : IMatrix; overload;
class function Histogram( data : TDoubleMatrix; nBins : integer; RowWise : Boolean = True ) : TDoubleMatrix; overload;
class function Histogram( data : TDoubleMatrix; nBins : integer; minVal, maxVal : double; RowWise : Boolean = True ) : TDoubleMatrix; overload;
// some probability density functions
class function UniformPDF( x : TDoubleMatrix; a, b : double ) : TDoubleMatrix;
class function NormalPDF( x : TDoubleMatrix; mu, sigma : double ) : TDoubleMatrix;
class function LogNormalPDF( X : TDoubleMatrix; mu, sigma : double ) : TDoubleMatrix;
class function PoissonPDF( x : TDoubleMatrix; lambda : double ) : TDoubleMatrix;
class function RayleighPDF( x : TDoubleMatrix; sigma : double ) : TDoubleMatrix;
class function WeibullPDF( x : TDoubleMatrix; lambda, k : double ) : TDoubleMatrix;
class function ExpPDF( x : TDoubleMatrix; lambda : double ) : TDoubleMatrix;
class function BinomPDF( x : TDoubleMatrix; n, p : integer ) : TDoubleMatrix;
end;
Create a few distributions and calculate the StudentTValue:
procedure TTestStatFunc.TestStudentT1;
var rnd : TRandomGenerator;
dat1 : Array[0..5] of TDoubleDynArray;
y : Integer;
x: Integer;
t, prob : double;
const cMean : Array[0..5] of double = (0.5, -0.1, 2, 0.4, 0.5, -0.101);
cNum : Array[0..5] of integer = (13000, 14400, 5000, 1000, 8000, 10000);
begin
InitMathFunctions(itFPU, False);
// test with same variance -> test just checks if the routine does not crash
rnd := TRandomGenerator.Create(raMersenneTwister);
rnd.Init( 455 );
try
for y := 0 to High(dat1) do
begin
SetLength( dat1[y], cNum[y] );
for x := 0 to Length(dat1[y]) - 1 do
dat1[y][x] := sqr(0.7)*rnd.RandGauss + cMean[y];
end;
finally
rnd.Free;
end;
for y := 0 to High(dat1) - 1 do
begin
for x := y + 1 to High(dat1) do
begin
StudentTTest(@dat1[y][0], cNum[y], @dat1[x][0], cNum[x], t, prob);
Status( Format('Student T %.2f, %d, %.2f, %d, %.3f, %.3f', [cMean[y], cNum[y], cMean[x], cNum[x], t, prob] ) );
if abs(cMean[y] - cMean[x]) < 0.02 then
begin
StudentTTest(@dat1[y][0], cNum[y], @dat1[x][0], cNum[x], t, prob);
Check( prob > 0.1, 'Student t Probability failed');
end;
end;
end;
end;
Kolmogorov-Smirnov Test with a function:
function UniformPDF(x : double) : double;
begin
Result := x; // just a linear relationship between 0 and 1
end;
procedure TTestStatFunc.TestKomogorov1;
var rnd : TRandomGenerator;
data : TDoubleDynArray;
i : Integer;
d, prob : double;
const cPVal = 0.85746; // from octave... maybe off by a bit
begin
rnd := TRandomGenerator.Create(raMersenneTwister);
try
rnd.Init(763);
// uniformly distributed -> linear increasing when sorted
SetLength(data, 10000);
for i := 0 to Length(data) - 1 do
data[i] := rnd.Random; // i/1000 + 0.00001*(rnd.Random - 0.5);
finally
rnd.Free;
end;
//WriteMatlabData('D:\kolm.txt', data, Length(data));
kolmogorovSmirnovOne(@data[0], Length(data), {$IFDEF FPC}@{$ENDIF}UniformPDF, d, prob);
Status(Format('Kolmogoriv result on uniform distributions: %.4f , %.3f', [d, prob]));
Check(SameValue( prob, cPVal, 1e-3 ), 'Error uniform distribution p value wrong' );
end;