p value

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895822/

In Brief: The P Value: What Is It and What Does It Tell You?

Frederick Dorey, PhD corresponding author

Author information ► Copyright and License information ►

This article has been cited by other articles in PMC.

Background

In medical papers today there usually are several statements based on the result of hypothesis tests presented, along with the associated p values. For example, a recent article by van Raaij et al. [1] compared the use of laterally wedged insoles with valgus braces for reduction of pain or improving function in selected patients with osteoarthritis. One of the statements made in that randomized study was that “At 6 months, 71% of patients in the insole group complied with the treatment, which was greater (p = 0.015) than 45% for the brace group” [1].

Go to:

Question

How does this hypothesis test address the issue of compliance between these two approaches, what information is supplied by the associated p value, and how should it be interpreted?

Go to:

Discussion

The primary purpose of an hypothesis test is to decide if the results of a study, based on a small sample, provide enough evidence against the null hypothesis (denoted by H0), so that it is reasonable to believe that in a larger target population, H0 is false, thus accepting the associated alternative hypothesis (denoted by H1) as being true. The null hypothesis for this situation states that there is no meaningful clinical difference between the two treatment approaches in terms of the percent compliance in the target population [1]; formally stated, the expected difference between the percent compliance in the two samples should be zero. The alternative hypothesis is that there is a meaningful difference in percent compliance between the two treatments in the target population. van Raaij et al. reported a large difference of 26% between the two treatments [1]. The hypothesis test is designed to help determine if a 26% difference is so large and the resulting p value of 0.015 so small that we should reject H0.

First and foremost, a p value is simply a probability. However, it is a conditional probability, in that its calculation is based on an assumption (condition) that H0 is true. This is the most critical concept to keep in mind as it means that one cannot infer from the p value whether H0 is true or false. More specifically, after we assume H0 is true, the p value only gives us the probability that, simply owing to the chance selection of patients from the larger (target) population, the clinical experiment resulted in a difference in the samples, as large or larger, than the actual 26% observed [1]. If a resulting small p value suggests that chance was not responsible for the observed difference of 26% and the randomization of patients, as in this case [1], makes the presence of bias unlikely, then the most likely conclusion is that in the target population the treatments must produce different compliance results.

Thus a p value is simply a measure of the strength of evidence against H0. A study with a p = 0.531 has much less evidence against H0 than a study with a p = 0.058. However, a study with a p = 0.058 provides similar evidence as a study with a p = 0.049 and a study with a p = 0.049 also has much less evidence than one with a p = 0.015. Although a very small p value does provide strong evidence that H0 is not true, a very large p value, even as large as 0.88, does not provide real evidence that H0 is true. For example, the alternative hypothesis might in fact still be true but owing to a small sample size, the study did not have enough power to detect that H0 was likely to be false. This notion, referred to as the power of the test, will be discussed later.

Authors sometimes take a formal approach in evaluating the results of an hypothesis test. An artificial cut point is chosen, called the significance level, and the result is called statistically significant if the p value is less than the significance level leading to the rejection of the null hypothesis. Although 5% usually is taken as the significance level, there is no real scientific reason for choosing that versus any other small value. Always rejecting H0 when p is less than 5% results in an incorrect rejection of the null hypothesis 5% of the time. However, as there is no real practical difference between a p value of 0.06 and 0.045 from a probability point of view, it is difficult to understand why this rigorous interpretation has become the standard today. In the study by van Raaij et al. [1], the result is statistically significant at the 5% level as p = 0.015. However, if a similar difference of 26% had been found in a study with only 24 patients with insoles and 22 patients with braces, the associated p value (chi square test) would have been 0.081, a result that would be called not statistically significant. That would not have meant that there was no difference between the two treatments, but only that, with the given small sample size there is not enough evidence to reject H0.

Go to:

Myths and Misconceptions

There are several misconceptions associated with the interpretation of a p value. One of the most common ones is that the p value gives the probability that H0 is true. As mentioned earlier, as the p value is calculated based on an assumption that H0 is true it cannot provide information regarding whether H0 is in fact true. This argument also shows that first, p cannot be the probability that the alternative hypothesis is true. Second, the p value is very dependent on the sample size. Third, it is not true that the p value is the probability that any observed difference is simply attributable to the chance selection of subjects from the target population. The p value is calculated based on an assumption that chance is the only reason for observing any difference. Thus it cannot provide evidence for the truth of that statement. The concept of a p value is not simple and any statements associated with it must be considered cautiously. A wealth of information and references concerning these and other misinterpretations of p values can be found on the WEB. Finally, it is important to reemphasize that if the result of an hypothesis test is that a difference was not statistically significant, it does not mean that there is no difference between the treatment groups in the target population.

Go to:

Conclusion

The only question that the p value addresses is, does the experiment provide enough evidence to reasonably reject H0. The actual p value always should be indicated when presenting the results of a clinical study, as the p value as a probability, provides a continuous measure of the evidence against H0. In the study by van Raaij et al. [1], randomization of the patients, the observed difference of 26% between the treatments, and the very small p value of 0.015 suggest that rejection of the null hypothesis is reasonable. Finally, the question of just how much difference might exist between the treatments in the target population is not directly addressed by the p value. Although 26% is a reasonable estimate of that difference, a confidence interval is more appropriate to address that question.

Go to:

Reference

1. van Raaij TM, Reijman M, Brouwer RW, Bierma-Zeinstra SM, Verhaar JA. Medial knee osteoarthritis treated by insoles or braces: a randomized trial. Clin Orthop Relat Res. 2010 Feb 23. (Epub ahead of print) [PMC free article] [PubMed]

http://www.stat.ualberta.ca/~hooper/teaching/misc/Pvalue.pdf

What is a P-value?

I have found that many students are unsure about the interpretation of P-values and other

concepts related to tests of significance. These ideas are used repeatedly in various

applications so it is important that they be understood. I will explain the concepts in

general terms first, then their application in the problem of assessing normality.

We wish to test a null hypothesis against an alternative hypothesis using a dataset. The two

hypotheses specify two statistical models for the process that produced the data. The

alternative hypothesis is what we expect to be true if the null hypothesis is false. We

cannot prove that the alternative hypothesis is true but we may be able to demonstrate that

the alternative is much more plausible than the null hypothesis given the data. This

demonstration is usually expressed in terms of a probability (a P-value) quantifying the

strength of the evidence against the null hypothesis in favor of the alternative.

We ask whether the data appear to be consistent with the null hypothesis or whether it is

unlikely that we would obtain data of this kind if the null hypothesis were true, assuming

that at least one of the two hypotheses is true. We address this question by calculating the

value of a test statistic, i.e., a particular real-valued function of the data. To decide whether

the value of the test statistic is consistent with the null hypothesis, we need to know what

sampling variability to expect in our test statistic if the null hypothesis is true. In other

words, we need to know the null distribution, the distribution of the test statistic when the

null hypothesis is true. In many applications, the test statistic is defined so that its null

distribution is a “named” distribution for which tables are widely accessible; e.g., the

standard normal distribution, the Binomial distribution with n = 100 and p = 1/2, the t

distribution with 4 degrees of freedom, the chi-square distribution with 23 degrees of

freedom, the F distribution with 2 and 20 degrees of freedom.

Now, given the value of the test statistic (a number), and the null distribution of the test

statistic (a theoretical distribution usually represented by a probability density), we want to

see whether the test statistic is in the middle of the distribution (consistent with the null

hypothesis) or out in a tail of the distribution (making the alternative hypothesis seem more

plausible). Sometimes we will want to consider the right-hand tail, sometimes the left-hand

tail, and sometimes both tails, depending on how the test statistic and alternative hypothesis

are defined. Suppose that large positive values of the test statistic seem more plausible

under the alternative hypothesis than under the null hypothesis. Then we want a measure

of how far out our test statistic is in the right-hand tail of the null distribution. The P-value

provides a measure of this distance. The P-value (in this situation) is the probability to the

right of our test statistic calculated using the null distribution. The further out the test

statistic is in the tail, the smaller the P-value, and the stronger the evidence against the null

hypothesis in favor of the alternative.

The P-value can be interpreted in terms of a hypothetical repetition of the study. Suppose

the null hypothesis is true and a new dataset is obtained independently of the first dataset

but using the same sampling procedure. If the new dataset is used to calculate a new value

of the test statistic (same formula but new data), what is the probability that the new value

will be further out in the tail (assuming a one-tailed test) than the original value? This

probability is the P-value.

The P-value is often incorrectly interpreted as the probability that the null hypothesis is

true. Try not to make this mistake. In a frequentist interpretation of probability, there is

nothing random about whether the hypothesis is true, the randomness is in the process

generating the data. One can interpret “the probability that the null hypothesis is true” using

subjective probability, a measure of one’s belief that the null hypothesis is true. One can

then calculate this subjective probability by specifying a prior probability (subjective belief

before looking at the data) that the null hypothesis is true, and then use the data and the

model to update one’s subjective probability. This is called the Bayesian approach because

Bayes’ Theorem is used to update subjective probabilities to reflect new information.

When reporting a P-value to persons unfamiliar with statistics, it is often necessary to use

descriptive language to indicate the strength of the evidence. I tend to use the following

sort of language. Obviously the cut-offs are somewhat arbitrary and another person might

use different language.

P > 0.10

No evidence against the null hypothesis. The data appear to be

consistent with the null hypothesis.

0.05 < P < 0.10

Weak evidence against the null hypothesis in favor of the alternative.

0.01 < P < 0.05

Moderate evidence against the null hypothesis in favor of the

alternative.

0.001 < P < 0.01

Strong evidence against the null hypothesis in favor of the

alternative.

P < 0.001

Very strong evidence against the null hypothesis in favor of the

alternative.

In using this kind of language, one should keep in mind the difference between statistical

significance and practical significance. In a large study one may obtain a small P-value

even though the magnitude of the effect being tested is too small to be of importance (see

the discussion of power below). It is a good idea to support a P-value with a confidence

interval for the parameter being tested.

A P-value can also be reported more formally in terms of a fixed level

test. Here

is a

number selected independently of the data, usually 0.05 or 0.01, more rarely 0.10. We

reject the null hypothesis at level

if the P-value is smaller than

, otherwise we fail to

reject the null hypothesis at level

. I am not fond of this kind of language because it

suggests a more definite, clear-cut answer than is often available. There is essentially no

difference between a P-value of 0.051 and 0.049. In some situations it may be necessary

to proceed with some course of action based on our belief in whether the null or alternative

hypothesis is true. More often, it seems better to report the P-value as a measure of

evidence.

A fixed level

test can be calculated without first calculating a P-value. This is done by

comparing the test statistic with a critical value of the null distribution corresponding to the

level

. This is usually the easiest approach when doing hand calculations and using

statistical tables, which provide percentiles for a relatively small set of probabilities. Most

statistical software produces P-values which can be compared directly with

. There is no

need to repeat the calculation by hand.

Fixed level

tests are needed for discussing the power of a test, a useful concept when

planning a study. Suppose we are comparing a new medical treatment with a standard

treatment, the control. The null hypothesis is that of no treatment effect (no difference

between treatment and control). The alternative hypothesis is that the treatment effect

(mean difference of treatment minus control using some outcome variable) is positive. We

want to have good chance of reporting a small P-value assuming the alternative hypothesis

is true and the magnitude of the effect is large enough to be of practical importance. The

power of a level

test is defined to be the probability that the null hypothesis will be

rejected at level

(i.e., the P-value will be less than

) assuming the alternative hypothesis

is true. The power generally depends on the variability of the data (lower variance, higher

power), the sample size (higher n, higher power), and the magnitude of the effect (larger

effect, higher power)

posted @ 2017-11-11 18:55 papering 阅读(375) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部