Gazing into the Abyss of p-Hacking: A Shiny App for p-Hacking Simulation
What is p-Hacking?
The p-value is a core component of null hypothesis significance testing (NHST), a statistical framework that has found ubiquitous use across many scientific disciplines. A p-value is defined as the probability to obtain a result at least as extreme as the observed one if the null hypothesis is true (i.e., if there is no effect). If the p-value is smaller than a certain threshold called alpha level, then the test result is labeled “significant” and the null hypothesis is rejected. Researchers who are interested in showing an effect in their data (e.g., that a new medicine improved the health of patients) are therefore eager to obtain small p-values that allow them to reject the null hypothesis and claim the existence of an effect.
In recent years, failed attempts to replicate experiments have instigated investigations into how researchers use NHST in practice. Studies found that many researchers apply questionable research practices to render previously non-significant results significant. We summarize these practices under the term of p-hacking.
How Does p-Hacking Work?
All p-hacking strategies are based on the principle of alpha error accumulation. Basically, alpha error accumulation means that more and more hypothesis tests are conducted, the probability of making at least one false decision increases. Therefore, even if there is no effect in the population, the probability is very high that at least one hypothesis test will (erroneously) show a significant result, if a sufficiently large number of tests are conducted. Researchers then report this significant result, and claim to have found an effect.
Obvious Warning: Thou Shalt Not p-Hack!
Given the explanation above, it almost seems needless to say that p-hacking is detrimental and you should not do it. P-hacking slows down scientific progress by increasing the amount of false positive results in the literature. Additionally, p-hacking leads to an inflation of effect sizes that are published in the literature because only “extreme” results are reported. This means that p-hacking increases the number of cases where research wrongly claims an effect, and even if an effect exists, the reported effect size is likely to be larger than the true effect size.
Sounds bad? It actually is. What makes it even worse is that it is difficult to discover p-hacking in the literature. How can we tell whether a reported effect is real or p-hacked? How can we tell that a p-hacked significant result (i.e., a significant finding that a researcher found after running many hypothesis tests) is not actually a true effect that was discovered? The truth is, for a single finding, it is impossible to know. However, if we know what p-hacking strategies researchers employ, it is possible to predict what distributions of p-values and effect sizes will look like, and how the rate of false positive results will be changed compared to a situation without p-hacking. The purpose of this app is to showcase these scenarios using simulated data.
A Compendium of p-Hacking Strategies
In the literature, p-hacking has typically been described as being comprised of different strategies that researchers can use to tinker with their statistical results to achieve statistical significance. In order to learn more about the effects of p-hacking, it is important to understand all strategies and their effects on the reported scientific results. However, a comprehensive description of these strategies has been missing so far.
Here, we provide an overview of different p-hacking strategies that have been mentioned in the literature, together with a Shiny app that lets users explore the effects of p-hacking on the distribution of hypothesis testing results.
Exploring the Effects of P-Hacking
Each tab of this Shiny app lets the user explore the effects of a different p-hacking strategy. All tabs have the same structure: First, we describe the p-hacking strategy, and how we applied it in our simulations. Below, we present simulation results, specifically the distribution of p-values, the distribution of effect sizes (if applicable), and the rate of false positive results. On a panel on the right side, the user can adjust the settings of the simulation, including the severity of the p-hacking.
Common Settings
Several settings are common to the simulation of (almost) all p-hacking strategies. To avoid unnecessary repetition, we will describe these settings here.
p-Value selection method
In all simulation functions, it is necessary to specify how the final p-value is determined. There are three options: first significant simulates a situation where the researcher conducts a series of hypothesis tests, and stops as soon as the result is significant, that is, at the first significant p-value. In a comment on Simonsohn et al. (2014), Ulrich and Miller (2015) argued that researchers might instead engage in “ambitious” p-hacking, where the researcher conducts a series of hypothesis tests and selects the smallest significant p-value from the set. This strategy is implemented in the smallest significant option. Simonsohn (private comm.) argues that there might exist a third p-hacking strategy where the researcher tries a number of different analysis options, and selects the smallest p-value, no matter if it is significant or not. This strategy is implemented in the option smallest. The default strategy is first significant.
True effect size
The true effect size in all simulations is equal to zero.
Significance level
The significance level α determines the significance level for each hypothesis test. For example, if the significance level is set to α = 0.05 (the default), the simulation assumes that a researcher would call the result of a hypothesis test significant if p < 0.05.
Iterations
The iterations option determines the number of iterations in the simulation. The default setting is 1000.
Alternative
Whenever the simulations are based on t-tests, the option alternative can be specified. This option relates to the sidedness of the alternative hypothesis in the t-test. It can either be two-sided or greater. The default setting is two-sided.
Number of observations
The number of observations determines the sample size in the test. In the case of a t-test, the specified number refers to the observations per group. In the case of a linear regression, the specified number refers to the overall sample size.
Start simulation
A new simulation will be started when you click the Start simulation button on the bottom of the options panel in each tab. The progress of the simulation will be displayed in a small progress bar in the bottom right corner of the screen.
Resources
The code for this Shiny app as well as for the simulations can be found on https://github.com/nicebread/phacking_compendium.
About
This Shiny app and the underlying R-package were created by Angelika Stefan and Felix Schönbrodt. If you have questions or feature requests, submit a GitHub issue on https://github.com/nicebread/phacking_compendium or write an e-mail to a.m.stefan[at]uva.nl.
Scale Redefinition
The scale redefinition strategy assumes that one of the variables in the hypothesis test in question is a composite score (e.g., the mean of items in a personality inventory), and that a researcher manipulates which items are included in the composite score to obtain a significant result.
Here, we assume that the focal hypothesis test is a univariate linear regression, and that items are excluded based on the reliability coefficient Cronbach’s α in an iterative fashion. The underlying idea is to delete the item that contributes least to a reliable score, i.e., the item leading to the highest Cronbach’s α when deleted. After a candidate item for deletion has been found, the regression is recomputed with (1) the reduced score as a predictor, (2) the deleted item as a predictor, and (3) the score of all deleted items as a predictor, and the p-values are recorded.
The simulation function in this Shiny app allows the specification of the total number of items in the score, as well as their correlation. Users can also specify the maximum number of items deleted from the score. Naturally, this number should be smaller than the total number of items. Other options users can specify are the number of observations, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Controlling for Covariates
This p-hacking strategy exploits the common practice of controlling for covariates in statistical analyses. Here, we assume that a researcher is interested in an independent samples t-test. If this test does not yield a significant result, the researcher introduces a number of continuous covariates into the analysis (which will then be computed as an ANCOVA). We assume that all covariates are first entered into the analysis individually, and if this does not yield a significant result, they are added sequentially as y ~ x + cov1, y ~ x + cov1 + cov2, … (in decreasing order of correlation with the dependent variable).
The simulation function in this Shiny app allows the specification of the number of covariates, as well as their correlation. Users can also specify whether the ANCOVA models should include interaction terms. Note that the inclusion of interaction terms will slow down the computation considerably. Other options users can specify are the number of observations per group, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Discretizing variables
This p-hacking strategy is based on splitting a continuous variable into categories with regard to two or more arbitrary cutoff values. Here, we assume that at the start a researcher plans to conduct a univariate linear regression. If this analysis does not yield a significant result, the researcher discretizes the independent variable and compares the means of the resulting groups in the dependent variable. We simulate three approaches: (1) Compare high-scorers and low-scorers based on a median split; (2) conduct a three-way split of the independent variable and compare the two extreme groups; (3) conduct a three-way split of the independent variables and compare all three groups using an ANOVA.
The simulation function in this Shiny app allows the specification of the sample size, as well as of the p-value selection method, the significance level α, and the number of iterations in the simulation.
False positive rate (p-hacked)
False positive rate (original)
Favorable Imputation of Missing Values
This p-hacking strategy assumes that the original dataset a researcher is confronted with contains missing values. A researcher engaging in p-hacking can now try out different imputation methods to replace the missing values, until (possibly) a significant result is obtained. Here, we simulate this p-hacking strategy based on a univariate linear regression, because many imputation methods assume a regression context.
The simulation function in this Shiny app allows the specification of the total number of observations (observations with missing values are included in this number), the percentage of missing values, and the imputation methods that are used. The percentage of missing values defined is the same for the predictor and the outcome variable (e.g., if the percentage is set to 10%, there will be ten percent missing values in both the predictor and the outcome variable). Additionally, users can specify the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Incorrect Rounding
This p-hacking strategy is not based on tinkering with the data or the analyses, but on misreporting the analysis outcome. Usually, the result of a hypothesis test is significant if p ≤ α. However, as has been shown (e.g., Hartgerink, van Aert, van Nuijten, Wicherts, & van Assen, 2016), sometimes p-values that are slightly larger than the significance level are reported as significant, that is, p-values are incorrectly rounded down to p = α.
In the simulation function in this Shiny app, the user can specify the margin in which p-values should be rounded down, as well as the significance level. For example, if the significance level is specified as α = 0.05, and the margin is specified as 0.001, then all p-values below 0.05+0.001=0.051 will be reported as significant and rounded down to p = 0.05. Additionally, users can specify the direction of the test, and the number of simulation iterations.
Note that type I error rates of this p-hacking strategy can also be determined analytically. The theoretical α-level after p-hacking is equivalent to the sum of the original alpha level and the rounding margin.
False positive rate (p-hacked)
False positive rate (original)
Optional Stopping
Researchers engaging in optional stopping repeatedly inspect the results of the statistical tests during data collection. They stop data collection as soon as a significant result has been obtained or a maximum sample size is reached. Here, we assume that the underlying statistical test is an independent-samples t-test.
In the simulation function provided in this Shiny app, the user can specify the minimum sample size (per group), the maximum sample size (per group), and the number of observations that are collected at each step of the sampling process (step size). For example, if the minimum sample size is specified to be 10, the maximum sample size 30, and the step size 5, then interim analyses will be conducted at N = 10, N = 15, N = 20, N = 25, and N = 30. Additionally, users can define the direction of the hypothesis test, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Outlier Exclusion
In this p-hacking strategy, a researcher applies different outlier exclusion criteria to their data with the goal of obtaining a significant result in a focal hypothesis test. Here, we assume that the hypothesis test in question is a univariate linear regression. Further, we assume that the researcher first checks for potential outliers in the predictor variable (x) and in the outcome variable (y), and then reruns the analysis (1) without the xy pairs where x is an outlier, (2) without the xy pairs where y is an outlier, (3) without the xy pairs where x and y are outliers. We assume that this is done for each outlier exclusion method.
In the simulation function provided in this Shiny app, users can define the outlier exclusion methods that are applied, as well as the sample size, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Selective Reporting of the Dependent Variable
This p-hacking strategy assumes that the dataset contains multiple candidate dependent variables. For example, in a clinical trial, the treatment and control group could be compared on different outcome variables, such as mental and physical well-being. A researcher engaging in p-hacking would conduct one hypothesis test for each dependent variable, and selectively report the significant results. Here, we assume that the hypothesis test in question is an independent-samples t-test.
The simulation function in this Shiny app allows the specification of the number of dependent variables as well as their correlation. Additionally, users can define the number of observations per group, the direction of the test, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Selective Reporting of the Independent Variable
This p-hacking strategy assumes that an experiment or clinical trial contains multiple experimental groups and one control group. A researcher engaging in p-hacking statistically compares all experimental groups to the control group, and only report the significant results. Here, we assume that all conducted hypothesis tests are t-tests.
The simulation function in this Shiny app allows the specification of the number of experimental groups (independent variables), and their correlation. Additionally, users can set the number of observations per group, the direction of the test, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Exploiting Alternative Hypothesis Tests
Often, different statistical analysis techniques can be used to answer the same research question. This p-hacking strategy assumes that a researcher tries out different statistical analysis options and decides for the one yielding a significant result. Here, we assume that the hypothesis tests in question are an independent-samples t-test, a Welch test, a Wilcoxon test, and a Yuen test (with different levels of trimming).
The simulation function in this Shiny app allows users to specify the number of observations per group, the direction of the test, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Subgroup Analyses
This p-hacking strategy assumes that if an initial hypothesis test does not yield a significant result, a researcher would repeat the same hypothesis test on subgroups of the sample (e.g., right-handed and left-handed participants). Here, we assume that all subgroup variables have two levels, and that the hypothesis test is conducted on each level of the subgroup variables. Additionally, we assume that the hypothesis test in question is a t-test (e.g., between an experimental and a control condition). Note that we do not assume that the experimental and control condition are balanced within the subgroups. Therefore, within a subgroup, the number of participants in the experimental and control group can differ.
In the simulation function in this Shiny app, users can specify the number of observations per group in the original t-test, the number of subgroup variables, the direction of the test, the p-value selection method, the significance level α, and the number of simulation iterations.
False positive rate (p-hacked)
False positive rate (original)
Variable Transformation
This p-hacking strategy assumes that if an initial hypothesis test does not yield significant results, a researcher would apply transformations to the variables involved in the test. Here, we assume that the test in question is a univariate linear regression, and that the transformations are a natural log transformation (ln(x)), a square root transformation (√x), and an inverse transformation (1/x). Transformations can be applied to the predictor variable, to the outcome variable, or both.
In the simulation function in this Shiny app, users can specify which of the variables should be transformed. Additionally, they can specify the number of observations, the p-value selection method, the significance level α, and the number of simulation iterations.