How to Not Do Experiments: P-hacking

Experiments require a lot of risk: you have a hypothesis which you need to validate or reject by figuring out if you data varies enough from your null hypothesis. But what if you don’t find anything statistically significant? What if you find something significant, but it wasn’t what you were originally researching? Today I want to explain P-hacking, illustrate its real-world manifestations, quantify its mathematical consequences, and outline ethical standards and best practices necessary to safeguard scientific rigor.

What is P-hacking?

P-hacking, also referred to as data dredging, data snooping, or data mining, describes the practice of analyzing data until a statistically significant result is discovered, often to substantiate a pre-existing hypothesis. This process does not necessarily involve fabricating data; rather, it entails manipulating or rearranging existing results in various ways until something that appears “significant” emerges purely by chance. It typically involves conducting multiple analyses or data manipulations until p-values fall below a predetermined threshold. This approach fundamentally deviates from the established scientific method, which mandates the formulation of a hypothesis before data analysis, allowing the data to objectively inform conclusions. As one colloquialism aptly puts it, “if you torture the data long enough, it will confess”.

P-hacking is frequently discussed in conjunction with HARKing (Hypothesizing After the Results are Known) and publication bias. HARKing involves presenting hypotheses that were formulated after examining the data as if they were established before data collection. This is often characterized as an “opportunistic malpractice” where a researcher adjusts the narrative of their paper to align with the observed data. In contrast, P-hacking is described as a “procrustean practice,” meaning “making the data fit the hypothesis.” Here, the researcher begins with a specific hypothesis and then modifies the analytical pipeline to achieve the desired statistically significant result. The primary distinction lies in the conceptual target hypothesis: it remains constant in P-hacking but may undergo considerable alteration in HARKing. However, the boundary between these two practices is not always clear; extreme forms of P-hacking may necessitate post hoc hypothesizing, blurring the lines. Both P-hacking and HARKing contribute to an increase in false-positive rates. The overlap between P-hacking and HARKing, and their shared outcome of inflated false positives, reveals that these are not isolated phenomena but rather components of a broader array of “questionable research practices” that compromise scientific rigor.

Methods of P-hacking

Main methods of P-hacking include:

Selective Reporting: This involves reporting only favorable outcomes or statistically significant results from a series of multiple hypothesis tests, while omitting those that do not support the initial hypothesis. This is one of the most frequently cited examples of P-hacking.
Multiple Testing Without Correction: Performing numerous statistical tests on a dataset without appropriately adjusting the significance level dramatically increases the probability of obtaining false positives.
Discretizing Variables: Continuous variables may be arbitrarily split into categories post-hoc to generate significant findings.
Controlling for Covariates: Researchers might include or exclude covariates in a statistical model based on whether their inclusion yields a statistically significant result.

Examples of Potential P-hacking:

Several prominent cases illustrate the pervasive impact of P-hacking:

Power Posing: Early research suggested that adopting specific “power poses” could lead to improvements in mood and performance. This initial finding gained widespread attention, inspiring a best-selling book and a highly popular TED Talk. However, subsequent independent replication efforts largely failed to confirm these initial results, leading many to suspect that the original findings may have been influenced by P-hacking.
Brian Wansink: A well-known food researcher formerly at Cornell University, Brian Wansink was found to have engaged in scientific misconduct, including P-hacking, which resulted in the retraction of 13 of his papers that had accumulated over 20,000 citations. His documented practices included re-running analyses on adjusted datasets, stopping data collection prematurely upon observing a “lucky” significant result, and selectively reporting only findings with p-values below the conventional 0.05 threshold.

These examples underscore that P-hacking is not merely a theoretical concern but a tangible problem with substantial real-world consequences, influencing public policy, clinical guidelines, and the overall credibility of entire research fields. The recurring nature of P-hacking, particularly in high-profile cases, and the persistent pressure on researchers to publish positive findings , point to a fundamental flaw in the academic and publishing incentive structures. The current system, which often prioritizes “positive” and “significant” results for publication in prestigious journals, inadvertently creates an environment where researchers feel compelled to produce such findings. This pressure, combined with the inherent flexibility in data analysis, can inadvertently encourage P-hacking and selective reporting. Addressing P-hacking effectively, therefore, requires more than just educating individual researchers; it demands a systemic reform of how journals make editorial decisions, how funding bodies evaluate proposals, and how academic institutions assess career progression.

Why P-Hacking Undermines Research Rigor

In hypothesis testing, a p-value represents the probability of observing results as extreme as, or more extreme than, those obtained in a study, assuming that the null hypothesis (i.e., no true effect or difference) is true. The conventionally accepted threshold for statistical significance is a p-value of 0.05 (or α = 0.05). This threshold implies that if the null hypothesis is indeed true, there remains a 5% chance of observing a result that appears statistically significant purely by random chance. A Type I Error, commonly known as a false positive, occurs when a researcher incorrectly rejects the null hypothesis, concluding that an effect exists when, in reality, it does not. The p-value directly corresponds to the risk of committing a Type I error; thus, a p-value of 0.05 indicates a 5% probability of making such an error. In more important scenarios, p-values are lowered, often to 0.01 or 0.001, to further reduce the risk of committing a Type I error.

The underlying vulnerability that P-hacking exploits is the “multiple comparisons problem.” The ability to P-hack stems from the statistical likelihood that if one searches diligently enough for patterns in enough different places within a dataset, something statistically significant will eventually emerge purely by chance. When numerous statistical tests are conducted simultaneously on the same dataset without appropriate adjustments, the probability of obtaining at least one false positive result escalates dramatically.

To illustrate this mathematically, consider a scenario where a researcher conducts k independent hypothesis tests, each with a standard significance level α (e.g., 0.05). The probability P of obtaining at least one statistically significant result (a false positive) when, in truth, no real effects exist, is given by the formula:

\[P = 1 - (1 - \alpha)^k\]

Number of Tests	Probability of at Least One False Positive
1	$1 - (1 - 0.05) = 5\%$
5	$1 - (1 - 0.05)^5 = 22.6\%$
10	$1 - (1 - 0.05)^{10} = 40.1\%$
20	$1 - (1 - 0.05)^{20} = 64.2\%$
50	$1 - (1 - 0.05)^{50} = 92.3\%$

This table visually reinforces the rapid escalation of false positives, making the “probabilistic nightmare” tangible. Seeing the probability jump from 5% to over 64% with just 20 tests, and approaching 100% with 100 tests, vividly illustrates the danger in a way that the formula alone might not. It connects to real-world research scenarios where dozens or hundreds of comparisons are common, demonstrating how such practices, if uncorrected, almost guarantee spurious findings. This quantitative evidence strengthens the argument that P-hacking is a serious threat to scientific validity.

How to Avoid P-Hacking

Honestly, p-hacking can be mitigated with showing transparency in the hypothesis and collection processes, like locking down a hypothesis to test and following strict, preset guidelines when collecting and analyzing data. When wanting to check against multiple hypothesis at once, care must be taken to reduce values to account for multiple comparisons.

Even better would be to acknowledge the necessity for studies that don’t reject the null hypothesis. While not as flashy, publication bias is a huge proponent of p-hacking, as researchers who rely on positive results are more likely to publish their findings when they are significant, leading to a bias towards positive findings and the potential for methods that generate false positives.

Sources:

Detecting P-Hacking Methods: Number Analytics (I hate the fact it was AI generated, but was a useful starting point)

P-Hacking and Harking: How to Avoid Common Pitfalls in Scientific Replication: Simplifaster

When the Revolution Came for Amy Cuddy: New York Times

The prevalence of statistical reporting errors in psychology (1985–2013): Royal Society

P-Hacking: Health Journalism