Vancouver, BC

“[The] p-curve [is] a way to distinguish between selective reporting and truth. P-curve is the distribution of statistically significant p values for a set of independent findings. Its shape is diagnostic of the evidential value of that set of findings. We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole explanation of those findings.” (Simonsohn et al 2014, p. 534)

- Input: test statistics (\(z\), \(t\), \(F\), etc.)
- Output: Histogram and two (main) tests
- Heuristic: if significant findings are too close to .05, something is wrong.

**Tests are constructed incorrectly**- Result: incorrect assessment of evidence

**No justification for meta-analytic grouping**- Result: Debates over “proper” groupings are undecidable

- (…but, other critiques too)

- Test statistic: contains evidence relevant to parameter of interest
- Sampling distribution: Distribution of test statistic under hypotheses
*p*value: probability of obtaining as much evidence against a hypothesis, assuming it is true

Test 1: “evidential value”

- Is the test statistic surprisingly large among significant test statistics if \(\delta=0\)? \(\rightarrow\)
**\(\delta\) is larger than 0**

Test 2: “Lack of evidential value”

- Is the test statistic surprisingly small among significant test statistics if \(\delta^2\geq 2.34/N\)? \(\rightarrow\)
**\(\delta^2\) is smaller than \(2.34/N\)**

Test 1: “evidential value”

- Compute all
*right-tailed*p values - Transform to \(\chi^2\)/normal deviates (under null)
- Average
- Compute overall one-tailed p value

Test 2: “Lack of evidential value”

- Compute all
*left-tailed*p values - Transform to \(\chi^2\)/normal deviates (under null)
- Average
- Compute overall one-tailed p value

- Problem: Failure to respect evidential asymmetry of p values
- Result: Over-sensitivity to values near \(\alpha\)

- Problem: Failure to use same test statistic for both tests
- Result: Evidence in data is not respected

“If a set of studies can be meaningfully partitioned into subsets, it is the job of the individual who is

pcurving to determine if such partitioning should be performed, in much the same way that it is the job of the person analyzing experimental results to decide if a given effect should be tested on all observations combined or if a moderating factor is worth exploring.Heterogeneity, then, poses a challenge of interpretation, not of statistical inference.” (Simonsohn et al, 2014 p. 536)

**But what is the statistical inference?**

- Study 1: Gravitational waves, \(z=3.1, p<.0025\)

p curve | “Evidential value” | “Lack of evidential value” |
---|---|---|

Their app (v. 4.052) | p=0.039 |
p=0.826 |

LR test | p=0.039 |
p=0.826 |

**But what is the statistical inference?**

- Study 1: Gravitational waves, \(z=3.1, p<.0025\)
- Study 2: Power posing, \(z=2, p=.045\)

p curve | “Evidential value” | “Lack of evidential value” |
---|---|---|

Their app (v. 4.052) | p=0.382 |
p=0.292 |

LR test | p=0.175 |
p=0.486 |

You have to justify the *joining*, not the *splitting*!

**But without a process, all sets have equal claim.** (Reference class problem; Venn, 1888)