1.Overview and Descriptive Statistics

1.Populations, Samples, and Processes

An investigation will typically focus on a well-defined collection of objects constituting a population of interest.

When desired information is available for all objects in the population, we have what is called a census.

A subset of the population -- a sample -- is selected in some prescribed manner.

A variable is any characteristic whose value may change from one object to another in the population.

Univariate data set consists of observations on a single variable.
Bivariate data is when observations are made on each of two variables.
Multivariate data arises when observations are made on more than two variables.

Branches of Statistics

An investigator who has collected data may wish simply to summarize and describe important features of the data. This entails using methods from descriptive statisitcs.
Techniques for generlizing from a sample to a population are gathered within the branch of our discipline called inferential statisitcs.

Enumerative Versus Analytic Studies

Enumerative studies, interest is focused on a finite, indetifiable, unchanging collection of individuals or objects that make up a population.
Analytic studies are often carried out with the objective of improving a future product by taking action on a process of some sort.

Collecting Data

2. Pictorial and Tabular Methods in Descriptive Statistics

Notation

The number of observations in a single sample will often be denoted by n.

Given a data set consisting of n obversations on some variable x, the individual observations will be denoted by x₁,x₂,x₃,...,x_n.

Stem-and-Leaf Displays

Steps for Constructing a Stem-and-Leaf Display

Select one or more leading digits for the stem values. The trailing digits become the leaves.
List possible stem values in a vertival column.
Record the leaf for every observation beside the corresponding stem value.
Indicate the units for stems and leaves somplace in the display.

Dotplots

A dotplots is an attractive summary of numerical data when the data set is reasonably samll or there ar relatively few distinct data values. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale.

Histograms

A variable is discrete if its set of possible values either is finite or else can be list in an infinite sequence.
A variable is continuous if its possible values consist of an entire interval on the number line.

Consider data consisting of observations on a discrete variable x.

The frequency of any particular x value is the number of times that value occurs in the data set.
The relative frequency of a value is the fraction or proportion of time the value occurs.
A frequency distribution is a tabulation of the frequencies and/or relative frequency.

Histogram Shapes

Histograms come in a variety of shapes.

Unimodal histogram is one that rises to a single peak and then delines.
Bimodal histogram has two differernt peaks.
Multimodal histogram has more than two peaks.

A histogram is symmetric if the left half is the mirror image of the right half.
A unimodal is positively skewed if the right or upper tail is stretched our compared with the left or lower tail and negatively skewed if the stretching is to the left.

Qualitative Data

Multivariate Data

3. Measures of Location

The Mean

For a given set of number x₁,x₂,x₃,...,x_n, the most familiar and useful measure of the center is the mean, or arithmetic average of the set.

The Median

The word median is synonymous with "middle", and the sample median is indeed the middle value when the observations are ordered from smallest to largest.

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

A trimmed mean is a conpromise between mean and median. A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of sample and then averaging what is left over.

Categorical Data and Sample Proportions

4. Measures of Variability

Measures of Variability for Sample Data

The simplest measure of variability in a sample is the range, which is the difference between the largest and smallest sample values.

The sample variance, denoted by s²;

The sample standard deviation, denoted by s.

Motivation for s²

We will use σ² to denote the population variance and σ to denote the population standard deviation.

It is customary to refer to s² as being based on n-1 degrees of freedom(df).

This terminology results from the fact that although s² is based on the n quantities, these sum to 0, so specifying the values of any n-1 of the quantities determines the remaining value. For example, if n=4 and x₁-x=8,x₂-x=-6,x₄-x=-4, then automatically we have x₃-x=2, so only 3 of the 4 values of x_i-x are freely determined(3df).

A Computing Formula for s²

Boxplots

After the n observations in a data set are ordered from smallest to largest, the lower fourth and upper fourth are given by:

lower fourth:

median of the smallest n/2 observations, n even
median of the smallest (n+1)/2 observations, n odd

upper fourth:

median of the largest n/2 observations, n even
median of the largest (n+1)/2 observations, n odd

That is, the lower(upper) fourth is hte median of the smallest(largest) half of the data, where the median is included in both halves if n is odd. A measure of spread that is resistant to ourliersis th fourth spread ƒ_s, given by:

ƒ_s = upper fourth - lower fourth

Boxplots that Show Outliers

Any observation father than 1.5ƒ_s from the closest fourth is an outlier. An outlier is extreme if it is more than 3ƒ_sfrom the nearest fourth, and it is mild otherwise.

Comparative Boxplots

A comparative or side-by-side boxplot is a very effective way of revealing similarities and differences between two or more data sets consisting of observations on the same variable.

posted @ 2017-05-03 14:56 cyoutetsu 阅读(257) 评论(0) 收藏举报

刷新页面返回顶部

cyoutetsu