Sample Size to Estimate Population Parameters

1 Central Limit Theorem and Confidence Interval

1.1 Central Limit Theorem

Theorem (Central Limit Theorem) Assumption:

Let \({X_1}, \cdots, X_n\) be a random sample of size \(n\) taken from a population with mean \({\mu}\) and known variance \({\sigma^2}\) with replacement. (No population distribution assumption)
The population variance \({\sigma^2}\) is known
Lex \({\bar X}\) be the sample mean.

Then, the limiting form of the distribution of statistic \(Z\)

\[Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim \mathcal N(0,1) \quad n \rightarrow \infty \]

asymptotically follows the standard normal distribution. \(Z\) represents a standard normal distributed random variable.

1.2 Estimating CI of population mean \(\mu\) with known population variance \(\sigma\)

With the population variance \(\sigma\), the confidence interval (CI) of the population mean \(\mu\) is estimated as:

\[\mu = \bar X \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \]

where \({\alpha}\) represents the level of significance and \({1-\alpha}\) is called the confidence level. \({Z_{\alpha/2}}\) is the value of \({Z}\) such that the area in each of the tail under the standard normal curve is \({\alpha/2}\).

1.3 \(t\)-test

Theorem: Assumption:

Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a a normal distribution with mean \({\mu}\) and unknown variance.
The variance of the normal distribution is unknown
Lex \({\bar X}\) be the sample mean
Usually assume \(n < 40\). When \(n \geq 40\), \(T\) can be regarded as standard normal distribution.

Then, the statistic \({T}\)

\[T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t(n-1) \]

asymptotically follows a t-distribution with \({n-1}\) degrees of freedom.

1.4 Estimating CI of population mean \(\mu\) with unknown population

The confidence interval (CI) of the population mean \(\mu\) is estimated as:

\[\mu = \bar X \pm t_{\alpha/2} \frac{S}{\sqrt{n}} \]

where \(S\) is the standard deviation of the sample and calculated as:

\[S = \sqrt{ \dfrac{1}{(n-1)}\sum_{i=1}^{n}(X_i-\bar X)^2} \]

When \(n \geq 40\), we always employ the standard normal distribution to approximate the \(t\) distribution, that is:

\[\mu = \bar X \pm z_{\alpha/2} \frac{S}{\sqrt{n}} \]

2. Sample Size Determination

2.1 Theory

Estimation: making an inference about the population (for example, \(\mu\)), based on information contained in a sample (for example, \(\bar{x}\)).
The error of estimation is measured in terms of the distance between the estimate and its real (unknown) population value. If this error is given in actual values, it is often called tolerance (\(\pm\) given values).
Precision VS Tolerance
- Precision \(33.1 \pm 10\%\)
- Tolerance \(33.1 \pm 3.3\)
Example: \(\pm 10 \%\) precision at \(95 \%\) confidence level

The estimation value of the population mean \(\mu\) from a sample size \(n\) could be analyzed by:

\[\mu = \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} = \bar{x} \pm TO \]

where \(TO\) represents the tolerance:

\[TO = z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \]

Thus, we have the general sample size \(n\) formula:

\[n = \left(\frac{ z_{\alpha/2} \sigma}{TO} \right)^2 = \left(\frac{ z_{\alpha/2} S}{TO} \right)^2 \]

where the population standard deviation \(\sigma\) is usually substituted by the sample standard deviation \(S\) when \(n \geq 40\).

2.2 Example

Sample size for population mean

For a normal distributed population, specifying a 10% estimation error of the population mean \(\mu\) (i.e., \(\mu \pm 0.1 \mu\)) with a 95% confidence level, the sample size can be calculated as:

\[n = \left(\frac{ z_{\alpha/2} S}{TO} \right)^2 = \left(\frac{ z_{0.05/2} \sigma}{0.1 \mu} \right)^2 = 384 CV^2 \]

Where \(z_{0.05/2} = 1.96\), and \(CV = \sigma / \mu\) represents coefficient of variation.

Sample size for origin-destination (O-D) based survey

Population: Suppose there are \(B_1, B_2, \cdots, B_k\) possible O-D relationships with the corresponding \(p_1, p_2, \cdots, p_k\) probabilities such that \(\sum_{i=1}^{k} p_i = 1\)

Sample: In \(n\) trials, \(B_1\) was observed \(x_1\) times, \(B_2\) was observed \(x_2\) times, ..., \(B_k\) was observed \(x_k\) times where \(\sum_{i=1}^{k} x_i = n\).

Any \(x_k\) can be considered as a binomial random variable: \(x_k \sim \text{Bin}(p_k, n)\)
The binomial distribution \(\text{Bin}(p, n)\) could be approximately taken as a normal distribution distributed, given that the conditions that (i) \(n\) is sufficiently large and (ii) \(np(1 − p) \geq 9\) are satisfied.

Thus, for \(x_k\), if satisfy the two conditions, we have:

\[x_i \sim \mathcal{N} \left(\mu = n p_i, \sigma^2 = n p_i (1 - p_i) \right) \]
\(\hat{p}_i = x_i/n\) is the sample estimation of the probability \(p_i\):

\[\hat{p}_i = \frac{x_i}{n}, \quad \text{and} \quad \hat{p}_i = \frac{x_i}{n} \sim \mathcal{N} \left(p_i, \frac{p_i (1 - p_i)}{n} \right), \qquad i=1,2, \ldots, k \]

The CI of \(p_i\) is estimated as:

\[p_i = \hat{p}_i \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_i(1-\hat{p}_i)}{n}} = \hat{p}_i \pm TO_i \]

Eventually, the sample size can be calculated as:

\[n = z_{\alpha / 2}^{2} \frac{\hat{p}_{i}\left(1-\hat{p}_{i}\right)}{T O_{i}^{2}}, \qquad i=1,2, \ldots, k \]

And the precision can be calculated as:

\[\text{Precision} = \frac{T O_{i}}{\hat{p}_{i}}=z_{\alpha / 2} \sqrt{\frac{1-\hat{p}_{i}}{n \hat{p}_{i}}}, \qquad i=1,2, \ldots, k \]

Sample size for Roadside Interview Surveys

[Example 3.8 of Ref.[2] in p.p. 84] Assume at a control point beside a road:

\(N\) is the total cars cross this control point;
\(N_1\) is the total number of cars travelling between pair O–D\(_1\) of these cars \(N\);
\(n\) is the sample size which we wish to take a survey;
\(X_1\) cars travel between a specific origin–destination pair O–D\(_1\) of these samples \(n\);

\(X_1\) has a hypergeometric distribution \(H(N, N_1, n)\), and its expected value and variance are given by:

\[\mathbb{E}[X_1]=np, \quad \text{Var}[X_1]=np(1-p)(1-n/N); \quad \text{where} \ p = N_1/N \]

(Acutally, here the variance should be \(\text{Var}[X_1]=np(1-p)[1-(n-1)/(N-1)]\), see the Ref. [4])

Using a normal approximation (based on the central limit theorem) the distribution of \(X_1\) is:

\[X_1 \sim \mathcal{N}\big(np, np(1-p)(1-n/N)\big) \]

and an estimator for \(p\) is:

\[\hat{p} = X_1 /n \]

Therefore:

\[\hat{p} \sim \mathcal{N}\left(p, \frac{p(1-p)(1-n/N)}{n}\right) \]

and an approximate \(100(1-\alpha)\%\) confidence interval for \(p\) is given by:

\[p \in \hat{p} \pm z_{\alpha/2} \sqrt{\frac{p(1-p)(1-n/N)}{n}} \]

Typically require that the absolute error \(e\) (i.e., tolerance \(TO\)) associated with \(\hat{p}\) does not exceed a prespecified value (usually 0.1):

\[TO = z_{\alpha/2} \sqrt{\frac{p(1-p)(1-n/N)}{n}} \leq e \]

Hence, we have:

\[n \geq \dfrac{p(1-p)}{ \left( \dfrac{e}{z_{\alpha/2}} \right)^2 + \dfrac{p(1-p)}{N}} \]

It can be seen that, for a given \(N\), \(e\) and \(z_{\alpha/2}\), the value \(p = 0.5\) yields the highest (i.e. most conservative) value for \(n\):

\[\max _{0 \leq p \leq 1} L \left(p|N,e,z_{\alpha/2} \right) = \dfrac{p(1-p)}{ \left( \dfrac{e}{z_{\alpha/2}} \right)^2 + \dfrac{p(1-p)}{N}} \quad \Rightarrow \quad p^{*} = \frac{1}{2} \]

and the most conservative \(n\) is:

\[n^* = \dfrac{0.25}{ \left( \dfrac{e}{z_{\alpha/2}} \right)^2 + \dfrac{0.25}{N}} \]

References

[1] C. Heumann, M. Schomaker and Shalabh, "10.3 Parametric Tests for Location Parameters", in Introduction to Statistics and Data Analysis. Cham: Springer International Publishing, 2016.

[2] J. de D. Ortúzar S. and L. G. Willumsen, "3.1.1.2 Sample Size to Estimate Population Parameters", in Modelling Transport, Fourth edition. Chichester, West Sussex, United Kingdom: John Wiley & Sons, 2011, p.p. 57-59.

[3] A. Ceder, "2.4 Basic statistical tools", in Public transit planning and operation: theory, modelling and practice, 1st ed. Amsterdam, Heidelberg: Butterworth-Heinemann, 2007, p.p. 30-37.

[4] Hypergeometric distribution, wikipedia, website.

posted @ 2022-02-14 12:50 veager 阅读(76) 评论(0) 收藏举报

刷新页面返回顶部