Skip to content

Contingency Tables

David Wright edited this page Apr 16, 2018 · 3 revisions

Contingency tables show the observed counts of all possible combinations of outcomes in a discrete, bivariate sample. They are the most common analysis tool for such samples, and arise often in clinical trials and classifiers.

Suppose we have developed a test for some underlying condition. The result of the test is either P or N. The subject either has the underlying condition (True) or does not (False). Here is the data:

Test \ Actual True False
P 35 65
N 4 896

How do I create a contingency table from count data?

Very easily:

using System;
using Meta.Numerics.Statistics;

ContingencyTable<string, bool> contingency = new ContingencyTable<string, bool>(
    new string[] { "P", "N" }, new bool[] { true, false }
);
contingency["P", true] = 35;
contingency["P", false] = 65;
contingency["N", true] = 4;
contingency["N", false] = 896;

Just give the constructor two lists with the distinct instances that will be the row and column labels.

How do I create a contingency table from bivariate data?

If you have the data columns, just give them to the Bivariate.Crosstabs method and it will compute your contingency table. For example:

using System.Collections.Generic;

IReadOnlyList<string> x = new string[] { "N", "P", "N", "N", "P", "N", "N", "N", "P" };
IReadOnlyList<bool> y = new bool[] { false, false, false, true, true, false, false, false, true };
ContingencyTable<string, bool> contingencyFromLists = Bivariate.Crosstabs(x, y);

Notice that both the constructor and the Crosstabs methods take lists of objects, so don't confuse them. The constructor takes lists of distinct values that form the row and column labels, and initializes all counts to zero. The Crosstabs method takes (usually much longer) lists of paired measurements (which typically include repeats). It extracts the column labels and computes the counts.

How do I obtain row and column totals?

Very easily:

foreach (string row in contingency.Rows) {
    Console.WriteLine($"Total count of {row}: {contingency.RowTotal(row)}");
}
foreach (bool column in contingency.Columns) {
    Console.WriteLine($"Total count of {column}: {contingency.ColumnTotal(column)}");
}
Console.WriteLine($"Total counts: {contingency.Total}");

Notice that, because our ContingencyTable type is generic with row and column type parameters, its methods accept typed arguments to refer to row and column labels. This makes code clearer and less error-prone.

How do I obtain estimates of the population probabilities?

using Meta.Numerics;

foreach (string row in contingency.Rows) {
    UncertainValue probability = contingency.ProbabilityOfRow(row);
    Console.WriteLine($"Estimated probability of {row}: {probability}");
}
foreach (bool column in contingency.Columns) {
    UncertainValue probability = contingency.ProbabilityOfColumn(column);
    Console.WriteLine($"Estimated probablity of {column}: {probability}");
}

Notice that you are not just given best estimates, but also error bars. (The best estimates are just what you would expect: the fraction of the total count represented by each row and column total. The computation of the error bars is more complicated, but Meta.Numerics handles it for you.)

How do I obtain estimates of the conditional population probabilities?

UncertainValue sensitivity = contingency.ProbabilityOfRowConditionalOnColumn("P", true);
Console.WriteLine($"Chance of P result given true condition: {sensitivity}");
UncertainValue precision = contingency.ProbabilityOfColumnConditionalOnRow(true, "P");
Console.WriteLine($"Chance of true condition given P result: {precision}");

Notice that our example exhibits a very common characteristic of tests for rare conditions: even though the test is sensitive (i.e. has a low chance of giving the wrong result for a given condition), it is not precise (i.e. given a P result, there is nonetheless a high chance that the condition is not actually present). The confusion of these two conditional probabilities is infamously common, and is usually called the Prosecutor's fallacy. At the price of a couple of long method names, our API clearly distinguishes between them.

How to I do tests for association?

The canonical way to test for an association between discrete variables is the Pearson chi squared test, and Meta.Numerics can do that for you:

TestResult pearson = contingency.PearsonChiSquaredTest();
Console.WriteLine($"Pearson χ² = {pearson.Statistic.Value} has P = {pearson.Probability}.");

If some cell entries are small, as is the case with out example data, the assumptions of Pearson's test will not be well-fulfilled, and it is better to use Fisher's exact test. For 2X2 tables like this one, Meta.Numerics can do that for you too:

TestResult fisher = contingency.Binary.FisherExactTest();
Console.WriteLine($"Fisher exact test has P = {fisher.Probability}.");

For any half-decent classifier (or any decent treatment in a clinical trial), there will be a statistically significant association between row and column values, so the P-values will be tiny.

How do I get an odds ratio?

The odds ratio, or its log, is the usual way to quantify the degree of association for a 2X2 table. Meta.Numerics can compute it for you:

UncertainValue logOddsRatio = contingency.Binary.LogOddsRatio;
Console.WriteLine($"log(r) = {logOddsRatio}");

Notice that we give you an estimate with uncertainty. For any table with a statistically significant association between rows and columns, the error bars should exclude 0 for the log odds ratio.

What about contingency tables larger than 2X2

No problem. You can construct and manipulate the in exactly the same ways. The only different is the Binary property, which is used to access the API surface specific to 2X2 tables, is not available. If you try to access the Binary property on a non-binary table, you will get an InvalidOperationException.

Home

Clone this wiki locally