关于χ²分布和统计
Recently I was helping my wife review some research papers in her physiotherapy area, some of which involve certain amount of statistical analysis, particularly χ² statistic which I later found is not quite a trivial statistical topic. I realized I have forgotten the majority of what I learned about probability and statistics in the university. Now I have to pick some of them up again. Fortunately, Wikipedia is always very handy for such needs.
First of all, what is χ² distribution? (As it is what χ² is based on)
In short, k-order χ² distribution or χ² distribution with k (k is a positive integer) degrees of freedom is the sum of squares of k independent standard normal random variables (random variables with thestandard normal (Gaussian) distribution). It degenerates to a standard normal random variable when k is 1.
Its probability density function and cumulative density function are both given in the Wikipedia article about it.
However what is interesting is its utilization as a mathematical tool in statistic tests.
Think about the following scenario.
From hypothesis, in a certain area, the ratio of the number of men to that of women is 1.1:1, and we can use the tool developed using χ² statistic to test how likely this 'theory' or statement is NOT true.
To answer this question, the key is to create a formula in a similar form to χ² where the errors or differences are corresponding to the individual random variables in χ².
And at the same time we can draw a sample of people from that area in the number that can easily test the frequency, like 105, as the ideal match of the theoretical frequency would be 55 men and 50 women.
And the formula mentioned above is defined as (note this testing variable is also called χ², as it's a χ² test)
χ² = (Number of Men from the Sample - 55)^2 / 55 + (Number of Women from the Sample - 50)^2 / 50, provided the size of the sample is 105.
We can see either of the two components of the sum above should act like the square of a standard normal random variable if the statement is true, however they are completely correlated instead of independent as if one of them is known the other is determined.
So if we end up having 59 men and 46 women in the sample, we will have χ² = 0.61. Look up in the cdf of χ² for degrees of freedom being 1, we find the possibility of χ² over 0.61 is around 0.4 which is way above the conventional criteria for statistical significance 0.001. This possibility might be denoted by p in some literature. So normally we would not reject the null hypothesis.
Hmm, the above interpretation sounds not making much sense (esp. the fact that we treat the case with degrees of freedom being 1 whereas there are actually two terms involved), however that's what I understand from the Wikipedia articles. Will review and
correct that after a further study on the subject.
References: