R：分组数据的图形概括

分组数据可视为特殊的多组数据，区别是：在多组数据中各数值型变量的观测值指向不同的对象，而分组数据是指同一个数值型变量的观测值按另一个变量分成若干子集，这些子集指向同一个变量。

下面通过DAAG中的数据集cuckoos来看一下分组数据的特殊图形描述方法。

杜鹃把蛋下在其他种类鸟的鸟巢中，这些鸟会帮他们孵化，希望了解不同类的鸟巢中杜鹃蛋的长度，数据如下：

> library(DAAG)
> data(cuckoos)
> attach(cuckoos)
The following object(s) are masked from 'cuckoos (position 4)':

    breadth, id, length, species
> cuckoos
    length breadth       species id
1     21.7    16.1 meadow.pipit 21
2     22.6    17.0 meadow.pipit 22
46    22.7    16.3    tree.pipit 66
47    23.3    16.6    tree.pipit 67
61    22.0    17.0 hedge.sparrow 82
62    23.9    16.9 hedge.sparrow 83
75    21.8    16.0         robin 96
76    23.0    15.9         robin 97
91    23.0    16.3 pied.wagtail 198
92    23.4    16.7 pied.wagtail 199
106   19.8    15.0          wren 224
107   22.1    16.0          wren 225
（1）使用条件散点图

当数据集中含有一个或者多个因子变量时，可以使用条件散点图函数coplot（）做出因子变量不同水平的多个散点图。

>coplot(length ~ breadth | species)

（2）简单而又繁琐的方法是反复使用函数hist

> library(DAAG)
> data(cuckoos)
> attach(cuckoos)
The following object(s) are masked from 'cuckoos (position 6)':

    breadth, id, length, species
> len.mp <- length[species == 'meadow.pipit']
> len.tp <- length[species == 'tree.pipit']
> len.hs <- length[species == 'hedge.sparrow']
> len.r <- length[species == 'robin']
> len.pw <- length[species == 'pied.wagtail']
> len.w <- length[species == 'wren']
> par(mfrow = c(3,2))
> hist(len.mp,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.tp,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.hs,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.r,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.pw,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.w,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> par(mfrow=c(1,1))

把上面的长过程压缩为一个function：

library(DAAG)
data(cuckoos)

hists <- function ( x, y ){
      y <- factor(y)
      n <- length(levels(y))
      op <- par(mfcol=c(n,1),mar=c(2,4,1,1))
      b <- hist(x, plot = F)$breaks
      for ( l in levels(y) ){
          hist( x[y==l],
                breaks = b,
                probability = T,
                ylim = c(0,1.0),
                main = "",
                ylab = l,
                col = 'lightblue',
                xlab = ""
                )
          points(density(x[y==l]),
                 type='l',
                 lwd = 3,
                 col = 'red'
                )
      }
      par(op)
}

hists(cuckoos$length, cuckoos$species)

（3）运用lattice包中的直方图函数histogram（）：

> histogram(~length|species, data = cuckoos)

（4）使用框须图函数boxplot同时考查各组数据的分布：

> boxplot(length ~ species, data = cuckoos, xlab = 'length of egg', horizontal = TRUE)

（5）利用stripchart（）画条形图：

> stripchart(cuckoos$length ~ cuckoos$species, method='jitter')

（6）使用密度曲线图

lattice包中的函数densityplot（）可分别展示每组数据的密度曲线图

> densityplot( ~length | species, data = cuckoos)

——————————————————————————————————————————————————————————————————

【1】1-D Scatter Plots

Description

stripchart produces one dimensional scatter plots (or dot plots) of the given data. These plots are a good alternative to boxplots when sample sizes are small.

Usage

stripchart(x, ...)

## S3 method for class 'formula'
stripchart(x, data = NULL, dlab = NULL, ...,
           subset, na.action = NULL)


## Default S3 method:
stripchart(x, method = "overplot", jitter = 0.1, offset = 1/3,
           vertical = FALSE, group.names, add = FALSE,
           at = NULL, xlim = NULL, ylim = NULL,
           ylab=NULL, xlab=NULL, dlab="", glab="",
           log = "", pch = 0, col = par("fg"), cex = par("cex"), 
           axes = TRUE, frame.plot = axes, ...)

Arguments

`x`	the data from which the plots are to be produced. In the default method the data can be specified as a single numeric vector, or as list of numeric vectors, each corresponding to a component plot. In the `formula` method, a symbolic specification of the form `y ~ g` can be given, indicating the observations in the vector `y` are to be grouped according to the levels of the factor `g`. `NA`s are allowed in the data.
`data`	a data.frame (or list) from which the variables in `x` should be taken.
`subset`	an optional vector specifying a subset of observations to be used for plotting.
`na.action`	a function which indicates what should happen when the data contain `NA`s. The default is to ignore missing values in either the response or the group.
`...`	additional parameters passed to the default method, or by it to `plot`, `points`, `axis` and `title` to control the appearance of the plot.
`method`	the method to be used to separate coincident points. The default method `"overplot"` causes such points to be overplotted, but it is also possible to specify `"jitter"` to jitter the points, or `"stack"` have coincident points stacked. The last method only makes sense for very granular data.
`jitter`	when `method="jitter"` is used, `jitter` gives the amount of jittering applied.
`offset`	when stacking is used, points are stacked this many line-heights (symbol widths) apart.
`vertical`	when vertical is `TRUE` the plots are drawn vertically rather than the default horizontal.
`group.names`	group labels which will be printed alongside (or underneath) each plot.
`add`	logical, if true add the chart to the current plot.
`at`	numeric vector giving the locations where the charts should be drawn, particularly when `add = TRUE`; defaults to `1:n` where `n` is the number of boxes.
`ylab, xlab`	labels: see `title`.
`dlab, glab`	alternate way to specify axis labels: see ‘Details’.
`xlim, ylim`	plot limits: see `plot.window`.
`log`	on which axes to use a log scale: see `plot.default`
`pch, col, cex`	Graphical parameters: see `par`.
`axes, frame.plot`	Axis control: see `plot.default`

【2】Conditioning Plots

Description

This function produces two variants of the conditioning plots discussed in the reference below.

Usage

coplot(formula, data, given.values, panel = points, rows, columns,
       show.given = TRUE, col = par("fg"), pch = par("pch"), 
       bar.bg = c(num = gray(0.8), fac = gray(0.95)),
       xlab = c(x.name, paste("Given :", a.name)),
       ylab = c(y.name, paste("Given :", b.name)),
       subscripts = FALSE,
       axlabels = function(f) abbreviate(levels(f)),
       number = 6, overlap = 0.5, xlim, ylim, ...) 
co.intervals(x, number = 6, overlap = 0.5)

Arguments

`formula`	a formula describing the form of conditioning plot. A formula of the form `y ~ x \| a` indicates that plots of `y` versus `x` should be produced conditional on the variable `a`. A formula of the form `y ~ x\| a * b` indicates that plots of `y` versus `x` should be produced conditional on the two variables `a` and `b`. All three or four variables may be either numeric or factors. When `x` or `y` are factors, the result is almost as if `as.numeric()` was applied, whereas for factor `a` or `b`, the conditioning (and its graphics if `show.given` is true) are adapted.
`data`	a data frame containing values for any variables in the formula. By default the environment where `coplot` was called from is used.
`given.values`	a value or list of two values which determine how the conditioning on `a` and `b` is to take place. When there is no `b` (i.e., conditioning only on `a`), usually this is a matrix with two columns each row of which gives an interval, to be conditioned on, but is can also be a single vector of numbers or a set of factor levels (if the variable being conditioned on is a factor). In this case (no `b`), the result of `co.intervals` can be used directly as `given.values` argument.
`panel`	a `function(x, y, col, pch, ...)` which gives the action to be carried out in each panel of the display. The default is `points`.
`rows`	the panels of the plot are laid out in a `rows` by `columns` array. `rows` gives the number of rows in the array.
`columns`	the number of columns in the panel layout array.
`show.given`	logical (possibly of length 2 for 2 conditioning variables): should conditioning plots be shown for the corresponding conditioning variables (default `TRUE`).
`col`	a vector of colors to be used to plot the points. If too short, the values are recycled.
`pch`	a vector of plotting symbols or characters. If too short, the values are recycled.
`bar.bg`	a named vector with components `"num"` and `"fac"` giving the background colors for the (shingle) bars, for numeric and factor conditioning variables respectively.
`xlab`	character; labels to use for the x axis and the first conditioning variable. If only one label is given, it is used for the x axis and the default label is used for the conditioning variable.
`ylab`	character; labels to use for the y axis and any second conditioning variable.
`subscripts`	logical: if true the panel function is given an additional (third) argument `subscripts` giving the subscripts of the data passed to that panel.
`axlabels`	function for creating axis (tick) labels when x or y are factors.
`number`	integer; the number of conditioning intervals, for a and b, possibly of length 2. It is only used if the corresponding conditioning variable is not a `factor`.
`overlap`	numeric < 1; the fraction of overlap of the conditioning variables, possibly of length 2 for x and y direction. When overlap < 0, there will be gaps between the data slices.
`xlim`	the range for the x axis.
`ylim`	the range for the y axis.
`...`	additional arguments to the panel function.
`x`	a numeric vector.

posted on 2012-12-31 09:29 半个馒头阅读(2108) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部