Skip to content

nathan-lindstedt/randomization_tests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 

Repository files navigation

randomization_tests

THE SIGNIFICANCE OF PERMUTATION TESTS FOR PROGRAM ASSESSMENT WITH OBSERVATIONAL DATA: ADDRESSING THE ISSUE OF STATISTICAL INFERENCE WITH NONPROBABILITY SAMPLES

Abstract

Nonprobability samples and self-selected group memberships are frequent aspects of "people data" that analysts must navigate to infer the proper conclusions. The application of randomization tests or permutation tests to experimental data is more familiar to researchers within the behavioral sciences and the medical sciences, primarily as a remedy for their often less-than-ideal random samples. However, its relevance to nonprobability samples from observational data in the social sciences is less recognized. One reason for this unfamiliarity is that inferences under these conditions are limited to conclusions within the sample. Fortunately, this limitation is not a significant constraint for common research questions related to program assessment in nonexperimental settings, where there is no need to generalize findings to a larger hypothetical population. If an analyst wants to evaluate whether participants in a program experienced a significant change in outcomes using a nonprobability sample, their concern is with the outcome for that specific sample and its statistical significance. There is no need to generalize to a broader population to answer a sample-specific research inquiry.

permutation_test_img

Body Text

In the book Randomization Tests, Edgington (1980) opens with a trenchant critique of the twinned myth of experimental design and statistical inference:

Experimental design books and others on the application of statistical tests to experimental data perpetuate the long-standing fiction of random sampling in experimental research. Statistical inferences are said to require random sampling and to concern population parameters. In experimentation, however, random sampling is very infrequent; consequently, statistical inferences about populations are usually irrelevant. Thus there is no logical connection between the random sampling model and its application to data from the typical experiment. The artificiality of the random sampling assumption has undoubtedly contributed to the skepticism of some experimenters regarding the value of statistical tests. What is a more important consequence of failure to recognize the prevalence of nonrandom sampling in experimentation, however, is overlooking the need for special statistical procedures that are appropriate for nonrandom samples. As a result, the development and application of randomization tests have suffered.

Randomization tests are statistical tests in which the data are repeatedly divided, a test statistic (e.g., t or F) is computed for each data division, and the proportion of the data divisions with as large a test statistic value an the value for the obtained results determines the significance of the results. For testing hypotheses about experimental treatment effects, random assignment but not random sampling is required. In the absence of random sampling the statistical inferences are restricted to the subjects actually used in the experiment, and generalization to other subjects must be justified by non-statistical argument.

Random assignment is the only random element necessary for determining the significance of experimental results by the randomization test procedure; therefore assumptions regarding random sampling and those regarding normality, homogeneity of variance, and other characteristics of randomly sampled populations, are unnecessary. Thus, any statistical test, no matter how simple or complex, is transformed into a distribution-free test when significance is determined by the randomization test procedure. For any experiment with random assignment, the experimenter can guarantee the validity of any test [they want] to use by determining significance by the randomization test procedure. Chapter 1 summarizes various advantages of the randomization test procedure, including its potential for developing statistical tests to meet the special requirements of a particular experiment, and its usefulness in providing for the valid use of statistical tests on experimental data from a single subject.

A great deal of computation is involved in performing a randomization test and, for that reason, such a means of determining significance was impractical until recent years, when computers became accessible to experimenters. As the use of computers is essential for the practical application of randomization tests, computer programs for randomization tests accompany discussions throughout the book. The programs will be useful for a number of practical applications of randomization tests, but their main purpose is to show how programs for randomization tests are written.

Inasmuch as the determination of significance by the randomization test procedure makes any of the hundreds (perhaps thousands) of published statistical tests into randomization tests, the discussion of application of randomization tests in this book cannot be exhaustive. Applications in the book have been selected to illustrate different facets of randomization tests so that the experimenter will have a good basis for generalizing to other applications. (P. v-vii)

He then continues by sketching the outline of a solution, describing the intuition behind a simple but expensive test that leverages the notions of permutation and random assignment to address the issue of nonprobability (or nonrandom) samples:

A randomization test is a permutation test based on randomization (random assignment), where the test is carried out in the following manner. A test statistic is computed for the experimental data, then the data are permuted (divided or rearranged) repeatedly in a manner consistent with the random assignment procedure, and the test statistic is computed for each of the resulting data permutations. These data permutations, including the one representing the obtained results, constitute the reference set for determining significance. The proportion of data permutations in the reference set that have test statistic values greater than or equal to (or, for certain test statistics, less than or equal to) the value for the experimentally obtained results is the P-value (significance or probability value). If, for example, the proportion is 0.02, the P-value is 0.02, and the results are significant at the 0.05 but not the 0.01 level of significance. Determining significance on the basis of a distribution of test statistics generated by permuting the data is characteristic of all permutation tests; it is when the basis for permuting the data is random assignment that a permutation test is called a randomization test. (P. 1)

Given the language of 'experimentation' used throughout these passages, it is perhaps unsurprising that the application of randomization tests or permutation tests to experimental data is more familiar to researchers within the behavioral sciences (e.g., Mewhort, Johns, and Kelly 2010) and the medical sciences (e.g., Rigdon and Hudgens 2014) as a corrective for regularly non-ideal sampling conditions. On face value, lesser known is its relevance for observational data within the social sciences (yet see Taylor 2020; Taylor 2024). However, it should be noted that there is an established history of randomization tests and permutation tests within social network analysis as it is employed by the QAP (Hubert and Schultz 1976) and MRQAP (Krackhardt 1988) techniques. That said, the idea that randomization tests or permutation tests can be applied to observational data and not just experimental data is well founded (see Box and Andersen 1954; Chung and Fraser 1958; Rubin 1974).

Part of the reason for its unfamiliarity in the context of observational data are the limitations of nonprobability sample hypothesis tests to within-sample conclusions. Fortunately, given the scope of common research questions on program assessment under nonexperimental settings, which lack the need to generalize to a larger hypothetical population to answer, these limitations do not place any greater constraints on their scope than is needed for an answer. For example, if an analyst desires to assess whether adherents to a program experienced some significantly changed outcome using a nonprobability sample, all that analyst cares about is that outcome for that sample and if it was statistically significant. There is no need to generalize to a larger hypothetical population to complete that assessment. As Taylor (2024) succinctly explains regarding the practicalities of hypothesis testing with nonprobability data in his application of randomization inference techniques to a study of 34 white nationalist organizations (WNOs):

Since the data used here constitute a nonprobability sample, asymptotically derived standard errors and generalizations to a 'population' are not appropriate. I instead rely on p-values derived from Monte Carlo permutation tests to draw sample-specific inferences (Darlington & Hayes, 2016, pp. 513-514; Ernst, 2004; Ludbrook & Dudley, 1998; Manly, 2018). In this case, p-values indicate the proportion of estimates of a particular coefficient after randomly permuting the data that are greater than or equal to the absolute size of the observed ('real') coefficient. If the association between the outcome and some predictor were random, then we should expect any random reshuffling of one of those variables across cases to produce a regression estimate of a similar absolute magnitude. If the observed estimate is consistently larger than the ones produced after a series of random shuffles, then the observed estimate is likely generated from nonrandom mechanism. The outcome was the permuted variable, and permutations were done within organizations to reflect the fact that only within-organization variance is being modeled. (PP. 10-11)

Analyst concerns regarding the dread of 'self-selection' in experimental designs can be assuaged by reframing their understanding of the kind of hypotheses tested in terms of the more limited observational assessments, where data are collected by recording events as they 'naturally' take place without manipulations. As program participants are no longer subject to intervention group and control group assignments, but are instead observed to take part in some behavior, the analyst can only offer evidence that those engaging in that behavior differed significantly from their counterparts through the assumption of exchangeability. That is, the belief that the null hypothesis of there being no significant difference should hold for those in the sample when there actually is no significant difference between them. Here random assignment being induced via permuting. Importantly, the assumption of exchangeable data is weaker than the assumption of independent and identically distributed data. All independent and identically distributed data are exchangeable, but not all exchangeable data are independent and identically distributed.

Enter the randomization test or permutation test: a nonparametric method free from distributional assumptions. An overview of permutation methods is given in an article by Berry, Johnston, and Mielke (2011).

References:

Berry, K. J., Johnston, J. E., and P. W. Mielke. 2011. "Permutation methods." Wiley Interdisciplinary Reviews: Computational Statistics, 3(6):527-542.

Box, G. E. and S. L. Andersen. 1954. “Robust tests for variances and effect of non-normality and variance heterogeneity on standard tests.” Technical Report, North Carolina State University Institute of Statistics Mimeo Series.

Chung, J. H. and D. A. S. Fraser. 1958. "Randomization tests for a multivariate two-sample problem." Journal of the American Statistical Association, 53(283):729-735.

Edgington, E. S. 1980. Randomization tests. 2nd Ed. New York, NY: Marcel Dekker, Inc.

Hubert, L. J. and J. Schultz. 1976. "Quadratic assignment as a general data analysis strategy." British Journal of Mathematical and Statistical Psychology, 29:190-241.

Krackhardt, D. 1988. "Predicting with networks: nonparametric multiple regression analysis of dyadic data." Social Networks, 10:359-381.

Mewhort, D. J. K., Johns, B. T., and M. A. Kelly. 2010. "Applying the permutation test to factorial designs." Behavior Research Methods, 42:366-372.

Rigdon, J. and M. G. Hudgens. 2015. "Randomization inference for treatment effects on a binary outcome." Statistics in Medicine, 34(6):924-935.

Rubin, D. B. 1974. "Estimating causal effects of treatments in randomized and nonrandomized studies." Journal of educational Psychology, 66:688-701.

Taylor, M. A. 2020. "Visualization strategies for regression estimates with randomization inference." The Stata Journal, 20(2):309-335.

Taylor, M. A. 2024. "Attention, shocks, and relevance judgements: the case of white nationalism in the U.S. South, 1980-2008." Social Movement Studies, 1-18.

Dataset Citation:

Yeh, I. "Real Estate Valuation," UCI Machine Learning Repository, 2018. [Online]. Available: https://doi.org/10.24432/C5J30W.