p value

 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895822/

 

In Brief: The P Value: What Is It and What Does It Tell You?

 

Background

In medical papers today there usually are several statements based on the result of hypothesis tests presented, along with the associated p values. For example, a recent article by van Raaij et al. [1] compared the use of laterally wedged insoles with valgus braces for reduction of pain or improving function in selected patients with osteoarthritis. One of the statements made in that randomized study was that “At 6 months, 71% of patients in the insole group complied with the treatment, which was greater (p = 0.015) than 45% for the brace group” [1].

Question

How does this hypothesis test address the issue of compliance between these two approaches, what information is supplied by the associated p value, and how should it be interpreted?

Discussion

The primary purpose of an hypothesis test is to decide if the results of a study, based on a small sample, provide enough evidence against the null hypothesis (denoted by H0), so that it is reasonable to believe that in a larger target population, H0 is false, thus accepting the associated alternative hypothesis (denoted by H1) as being true. The null hypothesis for this situation states that there is no meaningful clinical difference between the two treatment approaches in terms of the percent compliance in the target population [1]; formally stated, the expected difference between the percent compliance in the two samples should be zero. The alternative hypothesis is that there is a meaningful difference in percent compliance between the two treatments in the target population. van Raaij et al. reported a large difference of 26% between the two treatments [1]. The hypothesis test is designed to help determine if a 26% difference is so large and the resulting p value of 0.015 so small that we should reject H0.

First and foremost, a p value is simply a probability. However, it is a conditional probability, in that its calculation is based on an assumption (condition) that H0 is true. This is the most critical concept to keep in mind as it means that one cannot infer from the p value whether H0 is true or false. More specifically, after we assume H0 is true, the p value only gives us the probability that, simply owing to the chance selection of patients from the larger (target) population, the clinical experiment resulted in a difference in the samples, as large or larger, than the actual 26% observed [1]. If a resulting small p value suggests that chance was not responsible for the observed difference of 26% and the randomization of patients, as in this case [1], makes the presence of bias unlikely, then the most likely conclusion is that in the target population the treatments must produce different compliance results.

Thus a p value is simply a measure of the strength of evidence against H0. A study with a p = 0.531 has much less evidence against H0 than a study with a p = 0.058. However, a study with a p = 0.058 provides similar evidence as a study with a p = 0.049 and a study with a p = 0.049 also has much less evidence than one with a p = 0.015. Although a very small p value does provide strong evidence that H0 is not true, a very large p value, even as large as 0.88, does not provide real evidence that H0 is true. For example, the alternative hypothesis might in fact still be true but owing to a small sample size, the study did not have enough power to detect that H0 was likely to be false. This notion, referred to as the power of the test, will be discussed later.

Authors sometimes take a formal approach in evaluating the results of an hypothesis test. An artificial cut point is chosen, called the significance level, and the result is called statistically significant if the p value is less than the significance level leading to the rejection of the null hypothesis. Although 5% usually is taken as the significance level, there is no real scientific reason for choosing that versus any other small value. Always rejecting H0 when p is less than 5% results in an incorrect rejection of the null hypothesis 5% of the time. However, as there is no real practical difference between a p value of 0.06 and 0.045 from a probability point of view, it is difficult to understand why this rigorous interpretation has become the standard today. In the study by van Raaij et al. [1], the result is statistically significant at the 5% level as p = 0.015. However, if a similar difference of 26% had been found in a study with only 24 patients with insoles and 22 patients with braces, the associated p value (chi square test) would have been 0.081, a result that would be called not statistically significant. That would not have meant that there was no difference between the two treatments, but only that, with the given small sample size there is not enough evidence to reject H0.

Myths and Misconceptions

There are several misconceptions associated with the interpretation of a p value. One of the most common ones is that the p value gives the probability that H0 is true. As mentioned earlier, as the p value is calculated based on an assumption that H0 is true it cannot provide information regarding whether H0 is in fact true. This argument also shows that first, p cannot be the probability that the alternative hypothesis is true. Second, the p value is very dependent on the sample size. Third, it is not true that the p value is the probability that any observed difference is simply attributable to the chance selection of subjects from the target population. The p value is calculated based on an assumption that chance is the only reason for observing any difference. Thus it cannot provide evidence for the truth of that statement. The concept of a p value is not simple and any statements associated with it must be considered cautiously. A wealth of information and references concerning these and other misinterpretations of p values can be found on the WEB. Finally, it is important to reemphasize that if the result of an hypothesis test is that a difference was not statistically significant, it does not mean that there is no difference between the treatment groups in the target population.

Conclusion

The only question that the p value addresses is, does the experiment provide enough evidence to reasonably reject H0. The actual p value always should be indicated when presenting the results of a clinical study, as the p value as a probability, provides a continuous measure of the evidence against H0. In the study by van Raaij et al. [1], randomization of the patients, the observed difference of 26% between the treatments, and the very small p value of 0.015 suggest that rejection of the null hypothesis is reasonable. Finally, the question of just how much difference might exist between the treatments in the target population is not directly addressed by the p value. Although 26% is a reasonable estimate of that difference, a confidence interval is more appropriate to address that question.

Reference

1. van Raaij TM, Reijman M, Brouwer RW, Bierma-Zeinstra SM, Verhaar JA. Medial knee osteoarthritis treated by insoles or braces: a randomized trial. Clin Orthop Relat Res. 2010 Feb 23. (Epub ahead of print) [PMC free article[PubMed]
 
 
 
 
http://www.stat.ualberta.ca/~hooper/teaching/misc/Pvalue.pdf
 
 
What is a P-value?
I have found that many students are unsure about the interpretation of P-values and other
concepts related to tests of significance. These ideas are used repeatedly in various
applications so it is important that they be understood. I will explain the concepts in
general terms first, then their application in the problem of assessing normality.
We wish to test a null hypothesis against an alternative hypothesis using a dataset. The two
hypotheses specify two statistical models for the process that produced the data. The
alternative hypothesis is what we expect to be true if the null hypothesis is false. We
cannot prove that the alternative hypothesis is true but we may be able to demonstrate that
the alternative is much more plausible than the null hypothesis given the data. This
demonstration is usually expressed in terms of a probability (a P-value) quantifying the
strength of the evidence against the null hypothesis in favor of the alternative.
We ask whether the data appear to be consistent with the null hypothesis or whether it is
unlikely that we would obtain data of this kind if the null hypothesis were true, assuming
that at least one of the two hypotheses is true. We address this question by calculating the
value of a test statistic, i.e., a particular real-valued function of the data. To decide whether
the value of the test statistic is consistent with the null hypothesis, we need to know what
sampling variability to expect in our test statistic if the null hypothesis is true. In other
words, we need to know the null distribution, the distribution of the test statistic when the
null hypothesis is true. In many applications, the test statistic is defined so that its null
distribution is a “named” distribution for which tables are widely accessible; e.g., the
standard normal distribution, the Binomial distribution with n = 100 and p = 1/2, the t
distribution with 4 degrees of freedom, the chi-square distribution with 23 degrees of
freedom, the F distribution with 2 and 20 degrees of freedom.
Now, given the value of the test statistic (a number), and the null distribution of the test
statistic (a theoretical distribution usually represented by a probability density), we want to
see whether the test statistic is in the middle of the distribution (consistent with the null
hypothesis) or out in a tail of the distribution (making the alternative hypothesis seem more
plausible). Sometimes we will want to consider the right-hand tail, sometimes the left-hand
tail, and sometimes both tails, depending on how the test statistic and alternative hypothesis
are defined. Suppose that large positive values of the test statistic seem more plausible
under the alternative hypothesis than under the null hypothesis. Then we want a measure
of how far out our test statistic is in the right-hand tail of the null distribution. The P-value
provides a measure of this distance. The P-value (in this situation) is the probability to the
right of our test statistic calculated using the null distribution. The further out the test
statistic is in the tail, the smaller the P-value, and the stronger the evidence against the null
hypothesis in favor of the alternative.
The P-value can be interpreted in terms of a hypothetical repetition of the study. Suppose
the null hypothesis is true and a new dataset is obtained independently of the first dataset
but using the same sampling procedure. If the new dataset is used to calculate a new value
of the test statistic (same formula but new data), what is the probability that the new value
will be further out in the tail (assuming a one-tailed test) than the original value? This
probability is the P-value.
The P-value is often incorrectly interpreted as the probability that the null hypothesis is
true. Try not to make this mistake. In a frequentist interpretation of probability, there is
nothing random about whether the hypothesis is true, the randomness is in the process
generating the data. One can interpret “the probability that the null hypothesis is true” using
subjective probability, a measure of one’s belief that the null hypothesis is true. One can
 
 
then calculate this subjective probability by specifying a prior probability (subjective belief
before looking at the data) that the null hypothesis is true, and then use the data and the
model to update one’s subjective probability. This is called the Bayesian approach because
Bayes’ Theorem is used to update subjective probabilities to reflect new information.
When reporting a P-value to persons unfamiliar with statistics, it is often necessary to use
descriptive language to indicate the strength of the evidence. I tend to use the following
sort of language. Obviously the cut-offs are somewhat arbitrary and another person might
use different language.
P > 0.10
No evidence against the null hypothesis. The data appear to be
consistent with the null hypothesis.
0.05 < P < 0.10
Weak evidence against the null hypothesis in favor of the alternative.
0.01 < P < 0.05
Moderate evidence against the null hypothesis in favor of the
alternative.
0.001 < P < 0.01
Strong evidence against the null hypothesis in favor of the
alternative.
P < 0.001
Very strong evidence against the null hypothesis in favor of the
alternative.
In using this kind of language, one should keep in mind the difference between statistical
significance and practical significance. In a large study one may obtain a small P-value
even though the magnitude of the effect being tested is too small to be of importance (see
the discussion of power below). It is a good idea to support a P-value with a confidence
interval for the parameter being tested.
A P-value can also be reported more formally in terms of a fixed level
α
test. Here
α
is a
number selected independently of the data, usually 0.05 or 0.01, more rarely 0.10. We
reject the null hypothesis at level
α
if the P-value is smaller than
α
, otherwise we fail to
reject the null hypothesis at level
α
. I am not fond of this kind of language because it
suggests a more definite, clear-cut answer than is often available. There is essentially no
difference between a P-value of 0.051 and 0.049. In some situations it may be necessary
to proceed with some course of action based on our belief in whether the null or alternative
hypothesis is true. More often, it seems better to report the P-value as a measure of
evidence.
A fixed level
α
test can be calculated without first calculating a P-value. This is done by
comparing the test statistic with a critical value of the null distribution corresponding to the
level
α
. This is usually the easiest approach when doing hand calculations and using
statistical tables, which provide percentiles for a relatively small set of probabilities. Most
statistical software produces P-values which can be compared directly with
α
. There is no
need to repeat the calculation by hand.
Fixed level
α
tests are needed for discussing the power of a test, a useful concept when
planning a study. Suppose we are comparing a new medical treatment with a standard
treatment, the control. The null hypothesis is that of no treatment effect (no difference
between treatment and control). The alternative hypothesis is that the treatment effect
(mean difference of treatment minus control using some outcome variable) is positive. We
want to have good chance of reporting a small P-value assuming the alternative hypothesis
 
 
is true and the magnitude of the effect is large enough to be of practical importance. The
power of a level
α
test is defined to be the probability that the null hypothesis will be
rejected at level
α
(i.e., the P-value will be less than
α
) assuming the alternative hypothesis
is true. The power generally depends on the variability of the data (lower variance, higher
power), the sample size (higher n, higher power), and the magnitude of the effect (larger
effect, higher power)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
posted @ 2017-11-11 18:55  papering  阅读(383)  评论(0编辑  收藏  举报