The p curve is not what you think it is


Richard D. Morey

Psychonomic Society Meeting 2017
Vancouver, BC

What is a \(p\) curve analysis?

“[The] p-curve [is] a way to distinguish between selective reporting and truth. P-curve is the distribution of statistically significant p values for a set of independent findings. Its shape is diagnostic of the evidential value of that set of findings. We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole explanation of those findings.” (Simonsohn et al 2014, p. 534)

What is a \(p\) curve analysis?

  • Input: test statistics (\(z\), \(t\), \(F\), etc.)
  • Output: Histogram and two (main) tests
  • Heuristic: if significant findings are too close to .05, something is wrong.

Two main critiques

  • Tests are constructed incorrectly
    • Result: incorrect assessment of evidence
  • No justification for meta-analytic grouping
    • Result: Debates over “proper” groupings are undecidable
  • (…but, other critiques too)

Building a significance test

  • Test statistic: contains evidence relevant to parameter of interest
  • Sampling distribution: Distribution of test statistic under hypotheses
  • p value: probability of obtaining as much evidence against a hypothesis, assuming it is true

Two \(z\) tests

Two \(z\) tests

Test 1: “evidential value”

  • Is the test statistic surprisingly large among significant test statistics if \(\delta=0\)? \(\rightarrow\) \(\delta\) is larger than 0

Test 2: “Lack of evidential value”

  • Is the test statistic surprisingly small among significant test statistics if \(\delta^2\geq 2.34/N\)? \(\rightarrow\) \(\delta^2\) is smaller than \(2.34/N\)

Combining p values

Test 1: “evidential value”

  1. Compute all right-tailed p values
  2. Transform to \(\chi^2\)/normal deviates (under null)
  3. Average
  4. Compute overall one-tailed p value

Test 2: “Lack of evidential value”

  1. Compute all left-tailed p values
  2. Transform to \(\chi^2\)/normal deviates (under null)
  3. Average
  4. Compute overall one-tailed p value

Statistical problems

  • Problem: Failure to respect evidential asymmetry of p values
    • Result: Over-sensitivity to values near \(\alpha\)
  • Problem: Failure to use same test statistic for both tests
    • Result: Evidence in data is not respected

Evidential asymmetry

Sensitivity to bound

Sensitivity to bound

Likelihood ratio

Evidential failure

What is a “set”?

“If a set of studies can be meaningfully partitioned into subsets, it is the job of the individual who is p curving to determine if such partitioning should be performed, in much the same way that it is the job of the person analyzing experimental results to decide if a given effect should be tested on all observations combined or if a moderating factor is worth exploring. Heterogeneity, then, poses a challenge of interpretation, not of statistical inference.” (Simonsohn et al, 2014 p. 536)

What is a “set”?

But what is the statistical inference?

  • Study 1: Gravitational waves, \(z=3.1, p<.0025\)



p curve “Evidential value” “Lack of evidential value”
Their app (v. 4.052) p=0.039 p=0.826
LR test p=0.039 p=0.826

What is a “set”?

But what is the statistical inference?

  • Study 1: Gravitational waves, \(z=3.1, p<.0025\)
  • Study 2: Power posing, \(z=2, p=.045\)


p curve “Evidential value” “Lack of evidential value”
Their app (v. 4.052) p=0.382 p=0.292
LR test p=0.175 p=0.486

What is a “set”?


You have to justify the joining, not the splitting!


But without a process, all sets have equal claim. (Reference class problem; Venn, 1888)

NOT a hypothetical problem