R:分组数据的图形概括

分组数据可视为特殊的多组数据,区别是:在多组数据中各数值型变量的观测值指向不同的对象,而分组数据是指同一个数值型变量的观测值按另一个变量分成若干子集,这些子集指向同一个变量。

下面通过DAAG中的数据集cuckoos来看一下分组数据的特殊图形描述方法。

杜鹃把蛋下在其他种类鸟的鸟巢中,这些鸟会帮他们孵化,希望了解不同类的鸟巢中杜鹃蛋的长度,数据如下:

> library(DAAG)
> data(cuckoos)
> attach(cuckoos)
The following object(s) are masked from 'cuckoos (position 4)':

    breadth, id, length, species
> cuckoos
    length breadth       species  id
1     21.7    16.1  meadow.pipit  21
2     22.6    17.0  meadow.pipit  22
46    22.7    16.3    tree.pipit  66
47    23.3    16.6    tree.pipit  67
61    22.0    17.0 hedge.sparrow  82
62    23.9    16.9 hedge.sparrow  83
75    21.8    16.0         robin  96
76    23.0    15.9         robin  97
91    23.0    16.3  pied.wagtail 198
92    23.4    16.7  pied.wagtail 199
106   19.8    15.0          wren 224
107   22.1    16.0          wren 225
(1)使用条件散点图

当数据集中含有一个或者多个因子变量时,可以使用条件散点图函数coplot()做出因子变量不同水平的多个散点图。

>coplot(length ~ breadth | species)

 

(2)简单而又繁琐的方法是反复使用函数hist

> library(DAAG)
> data(cuckoos)
> attach(cuckoos)
The following object(s) are masked from 'cuckoos (position 6)':

    breadth, id, length, species
> len.mp <- length[species == 'meadow.pipit']
> len.tp <- length[species == 'tree.pipit']
> len.hs <- length[species == 'hedge.sparrow']
> len.r  <- length[species == 'robin']
> len.pw <- length[species == 'pied.wagtail']
> len.w  <- length[species == 'wren']
> par(mfrow = c(3,2))
> hist(len.mp,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.tp,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.hs,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.r,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.pw,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> hist(len.w,
+       breaks = 6,
+       probability = T,
+       xlim = c(19,25),
+       ylim = c(0,1),
+       main = "",
+       col = 6)
> par(mfrow=c(1,1))

把上面的长过程压缩为一个function:

library(DAAG)
data(cuckoos)

hists <- function ( x, y ){
      y <- factor(y)
      n <- length(levels(y))
      op <- par(mfcol=c(n,1),mar=c(2,4,1,1))
      b <- hist(x, plot = F)$breaks
      for ( l in levels(y) ){
          hist( x[y==l],
                breaks = b,
                probability = T,
                ylim = c(0,1.0),
                main = "",
                ylab = l,
                col = 'lightblue',
                xlab = ""
                )
          points(density(x[y==l]),
                 type='l',
                 lwd = 3,
                 col = 'red'
                )
      }
      par(op)
}

hists(cuckoos$length, cuckoos$species)

(3)运用lattice包中的直方图函数histogram():

> histogram(~length|species, data = cuckoos)

(4)使用框须图函数boxplot同时考查各组数据的分布:

> boxplot(length ~ species, data = cuckoos,  xlab = 'length of egg',  horizontal = TRUE)

(5)利用stripchart()画条形图:

> stripchart(cuckoos$length ~ cuckoos$species, method='jitter')

(6)使用密度曲线图

lattice包中的函数densityplot()可分别展示每组数据的密度曲线图

> densityplot( ~length | species, data = cuckoos)

 

——————————————————————————————————————————————————————————————————

【1】1-D Scatter Plots

Description

stripchart produces one dimensional scatter plots (or dot plots) of the given data. These plots are a good alternative to boxplots when sample sizes are small.

Usage

stripchart(x, ...)

## S3 method for class 'formula'
stripchart(x, data = NULL, dlab = NULL, ...,
           subset, na.action = NULL)


## Default S3 method:
stripchart(x, method = "overplot", jitter = 0.1, offset = 1/3,
           vertical = FALSE, group.names, add = FALSE,
           at = NULL, xlim = NULL, ylim = NULL,
           ylab=NULL, xlab=NULL, dlab="", glab="",
           log = "", pch = 0, col = par("fg"), cex = par("cex"), 
           axes = TRUE, frame.plot = axes, ...)

Arguments

x

the data from which the plots are to be produced. In the default method the data can be specified as a single numeric vector, or as list of numeric vectors, each corresponding to a component plot. In the formula method, a symbolic specification of the form y ~ g can be given, indicating the observations in the vector y are to be grouped according to the levels of the factor g. NAs are allowed in the data.

data

a data.frame (or list) from which the variables in x should be taken.

subset

an optional vector specifying a subset of observations to be used for plotting.

na.action

a function which indicates what should happen when the data contain NAs. The default is to ignore missing values in either the response or the group.

...

additional parameters passed to the default method, or by it to plot, points, axis and title to control the appearance of the plot.

method

the method to be used to separate coincident points. The default method "overplot" causes such points to be overplotted, but it is also possible to specify "jitter" to jitter the points, or "stack" have coincident points stacked. The last method only makes sense for very granular data.

jitter

when method="jitter" is used, jitter gives the amount of jittering applied.

offset

when stacking is used, points are stacked this many line-heights (symbol widths) apart.

vertical

when vertical is TRUE the plots are drawn vertically rather than the default horizontal.

group.names

group labels which will be printed alongside (or underneath) each plot.

add

logical, if true add the chart to the current plot.

at

numeric vector giving the locations where the charts should be drawn, particularly when add = TRUE; defaults to 1:n where n is the number of boxes.

ylab, xlab

labels: see title.

dlab, glab

alternate way to specify axis labels: see ‘Details’.

xlim, ylim

plot limits: see plot.window.

log

on which axes to use a log scale: see plot.default

pch, col, cex

Graphical parameters: see par.

axes, frame.plot

Axis control: see plot.default

 

【2】Conditioning Plots

Description

This function produces two variants of the conditioning plots discussed in the reference below.

Usage

coplot(formula, data, given.values, panel = points, rows, columns,
       show.given = TRUE, col = par("fg"), pch = par("pch"), 
       bar.bg = c(num = gray(0.8), fac = gray(0.95)),
       xlab = c(x.name, paste("Given :", a.name)),
       ylab = c(y.name, paste("Given :", b.name)),
       subscripts = FALSE,
       axlabels = function(f) abbreviate(levels(f)),
       number = 6, overlap = 0.5, xlim, ylim, ...) 
co.intervals(x, number = 6, overlap = 0.5)

Arguments

formula

a formula describing the form of conditioning plot. A formula of the form y ~ x | a indicates that plots of y versus x should be produced conditional on the variable a. A formula of the form y ~ x| a * b indicates that plots of y versus x should be produced conditional on the two variables a and b.

All three or four variables may be either numeric or factors. When x or y are factors, the result is almost as if as.numeric() was applied, whereas for factor a or b, the conditioning (and its graphics if show.given is true) are adapted.

data

a data frame containing values for any variables in the formula. By default the environment where coplot was called from is used.

given.values

a value or list of two values which determine how the conditioning on a and b is to take place.

When there is no b (i.e., conditioning only on a), usually this is a matrix with two columns each row of which gives an interval, to be conditioned on, but is can also be a single vector of numbers or a set of factor levels (if the variable being conditioned on is a factor). In this case (no b), the result of co.intervals can be used directly as given.values argument.

panel

a function(x, y, col, pch, ...) which gives the action to be carried out in each panel of the display. The default is points.

rows

the panels of the plot are laid out in a rows by columns array. rows gives the number of rows in the array.

columns

the number of columns in the panel layout array.

show.given

logical (possibly of length 2 for 2 conditioning variables): should conditioning plots be shown for the corresponding conditioning variables (default TRUE).

col

a vector of colors to be used to plot the points. If too short, the values are recycled.

pch

a vector of plotting symbols or characters. If too short, the values are recycled.

bar.bg

a named vector with components "num" and "fac" giving the background colors for the (shingle) bars, for numeric and factor conditioning variables respectively.

xlab

character; labels to use for the x axis and the first conditioning variable. If only one label is given, it is used for the x axis and the default label is used for the conditioning variable.

ylab

character; labels to use for the y axis and any second conditioning variable.

subscripts

logical: if true the panel function is given an additional (third) argument subscripts giving the subscripts of the data passed to that panel.

axlabels

function for creating axis (tick) labels when x or y are factors.

number

integer; the number of conditioning intervals, for a and b, possibly of length 2. It is only used if the corresponding conditioning variable is not a factor.

overlap

numeric < 1; the fraction of overlap of the conditioning variables, possibly of length 2 for x and y direction. When overlap < 0, there will be gaps between the data slices.

xlim

the range for the x axis.

ylim

the range for the y axis.

...

additional arguments to the panel function.

x

a numeric vector.

posted on 2012-12-31 09:29  半个馒头  阅读(2108)  评论(0编辑  收藏  举报

导航