R Programming: Part 4 - Simulation & Base Graphics

Part 4 - Simulation & Base Graphics

4.1 Simulation
　4.1.1 Generating Random Numbers
　4.1.2 Simulating a Linear Model
　4.1.3 Random Sampling

4.2 Base Graphics
　4.2.1 Basic Histogram
　4.2.2 Basic Line Graph with Regression
　4.2.3 Scatterplot with Legend
　4.2.4 Boxplot with Reordered and Formatted Axes
　4.2.5 Barplot with Error Bars 😦
　4.2.6 Limitations of Base Graphics

4.1 Simulation

4.1.1 Generating Random Numbers

Generally, every probability distribution function is associated with four R corresponding functions, which are prefixed with

\(\mathrm{d}\) for probability density
\(\mathrm{r}\) for random number generation
\(\mathrm{p}\) for cumulative distribution
\(\mathrm{q}\) for quantile function

Take normal distribution for example, the four functions have the following prototypes:

dnorm(x, mean = 0, sd = 1, log = FALSE)
rnorm(n, mean = 0, sd = 1)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

where the arguments mean and sd (standard deviation) stand for \(\mu\) and \(\sigma\) in the normal distribution respectively, which are 0 and 1 by default (i.e. standard normal distribution).
The argument log/log.p implies whether the probabilities p are given as log(p). Lower.tail implies whether the probabilities involved are P[X <= x] or P[X > x].

Assuming the normal distribution to be
\(\qquad\displaystyle f(x) = \frac{1}{\sqrt{2\pi}\sigma}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
then the cumulative distribution function is
\(\qquad\displaystyle \Phi(x) = P(X \leq x) = \int_{-\infty}^x f(t) \mathrm{d}t\)
and the above four functions give the result as follows:

dnorm(x, \(\mu\), \(\sigma\)) = \(f(x)\)
rnorm(n, \(\mu\), \(\sigma\)) = a numeric vector with n values sampled from f(x)
pnorm(q, \(\mu\), \(\sigma\)) = \(P(X \leq q)\) = \(\Phi(q)\)
qnorm(p, \(\mu\), \(\sigma\)) = \(\Phi^{-1}(p)\)

Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known average rate and independently of the time since the last event.

The probability of observing k events in an interval, if they follow Poisson distribution, is given by the equation
\(\qquad\displaystyle P(k)=\frac {\lambda ^k}{k!} e^{-\lambda}\)
where \(\lambda\) is the average number of events per interval, named the rate parameter.

For Poisson distribution, we have dpois, rpois, ppois and qpois as well. But since Poisson distribution is discrete, dpois(x, \(\lambda\)) will give a warning and return 0 if x is not an integer.

> rpois(10, 20)
[1] 17 20 16 20 25 16 19 28 27 22
> ppois(6, 2)
[1] 0.9954662
> dpois(1.1, 2)
[1] 0
Warning message:
In dpois(1.1, 2) : non-integer x = 1.100000

Reproducibility of random numbers can be achieved by setting identical random seeds with set.seed() function. For others' convenience to replicate your computation, it is highly recommended to set the seed whenever conducting a simulation.

> set.seed(5)
> rnorm(5)
[1] -0.84085548  1.38435934 -1.25549186  0.07014277  1.71144087
> rnorm(5)
[1] -0.6029080 -0.4721664 -0.6353713 -0.2857736  0.1381082
> set.seed(5)
> rnorm(5)
[1] -0.84085548  1.38435934 -1.25549186  0.07014277  1.71144087

4.1.2 Simulating a Linear Model

Suppose we want to simulate from a simple linear model as follows
\(\qquad\displaystyle y = a + bx + \varepsilon\)
where the noise \(\varepsilon \sim N(0, 2^2)\). Now assume \(x \sim N(0, 1^2)\), \(a=0.5\) and \(b=2\).

> set.seed(20)   ## Important for reproducibility
> x <- rnorm(100)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.4080 -1.5400  0.6789  0.6893  2.9300  6.5050 
> plot(x, y)     ## Plot (x, y) in a graph

Plot 1

Now suppose x is binary, either 0 or 1 with 50/50 probability in each trial, then similarly

> set.seed(20)
> x <- rbinom(100, 1, 0.5)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4360 -0.2675  1.7800  1.5720  2.8810  6.9170 
> plot(x, y)

Plot 2

Now try simulating from a generalized linear model, e.g. Poisson model.
Suppose we have a Poisson model where
\(\qquad\displaystyle y \sim \mathrm{Pois}(\lambda)\)
\(\qquad\displaystyle \log\lambda = a + bx\)
and assuming \(x \sim N(0,1^2)\), \(a=0.5\), \(b=0.3\). We can simulate it with rpois function.

> set.seed(20)
> x <- rnorm(100)
> log_lambda <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log_lambda))
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    2.00    1.69    3.00    5.00 
> plot(x, y)

Plot 3

Note: Almost all of the arguments in functions associated with probability distribution can be vectors EXCEPT one which represents the number of variates to be generated in r* series functions.
E.g. rnorm(5, 1:5, 1:5) produces 5 variates which follow N(1, 1), N(2, 2), ..., N(5, 5) respectively.
But rnorm(1:5, 5, 1:5) does not produce the expected result. Mapply(rnorm, 1:5, 5, 1:5) can achieve the desired goal.

Other than normal, Poisson, binomial and uniform distribution, there are a lot more built-in standard distributions, like Bernoull, geometric, hypergeometric, negative binomial, Beta, Gamma, chi-squared, exponential, log-normal, multinomial, Cauchy, F (ratio of two chi-squared), etc.

4.1.3 Random Sampling

The sample() function can be used to draw random samples from arbitrary vectors.
Argument list: sample(x, size, replace = FALSE, prob = NULL)

x: an arbitrary vector or a single positive integer (treated as 1:x)
size: number of samples to be drawn, which is the length of x by default
replace: indicates whether to sample with replacement
prob: a vector of weights implying the probabilities each element in the vector is to be sampled, no need to sum to one

If only the argument x is assigned, then by default, size is equal to the length of x and replace is set to FALSE, so the sampling result is equivalent to a permutation of the original vector x.

> set.seed(20)
> sample(1:20, 5)
[1] 18 15  6  9 16
> sample(20, 5)                ## x = 20 equivalent to x = 1:20
[1] 20  2 19  6 17
> sample(letters, 5)           ## Any vector can be sampled from
[1] "s" "z" "a" "r" "e"
> sample(10)                   ## A permutation of 1:10
[1]  5  3  1  9 10  7  8  2  4  6
> sample(10, replace = TRUE)   ## Sample with replacement
[1]  1 10 10  1  7  4  5  9  2  6
> sample(10, 5, replace = TRUE, prob = c(rep(1, 9), 100))
[1] 10 10 10 10 10

4.2 Base Graphics

All data in the following sections are available here.

4.2.1 Basic Histogram

Have a look at our data about birthweight of male and female unicorns.

> unicorns <- read.table("unicorns.txt", header = T)
> head(unicorns)
  birthweight  sex longevity
1    4.478424 Male         1
2    5.753458 Male         0
3    3.277265 Male         0
4    3.929379 Male         0
5    3.972810 Male         0
6    4.912954 Male         0

We use hist() function to plot a histogram, whose arguments are listed as follows.

> hist(unicorns$birthweight,                          # Values of x-axis
       breaks = 40,                                   # Number of cells
       xlab = "Birth Weight",                         # X-axis label
       main = "Histogram of Unicorn Birth Weight",    # Plot title
	   ylim = c(0,80))                                # Limits of the y axis (min, max)

Histogram 1

By default, breaks will be computed according to a particular formula, xlab = vectorName, ylab = “Frequency”, main = “Histogram of” + vectorName.

The argument breaks can also be a vector indicating the breakpoints between the histogram cells.

> hist(unicorns$birthweight, breaks = c(0, 1, 4, 5, 7))

Histogram 2

4.2.2 Basic Line Graph with Regression

We have data on the population of moomins, a common pest species in Finland, on the island of Ruissalo from 1971 to 2001.

> moomins <- read.table("Moomin Density.txt", header = T)
> head(moomins)
  Year PopSize
1 1971     500
2 1972     562
3 1973     544
4 1974     532
5 1975     580
6 1976     590

> plot(moomins$Year, moomins$PopSize,       # X variable and y variable
       type = "l",                          # Draw a line graph
       col = "red",                         # Line colour
       lwd = 3,                             # Line width
       xlab = "Year",                       # X-axis label
       ylab = "Population Size",            # Y-axis label
       main = "Moomin Population Size on Ruissalo 1971 - 2001")
									        # Plot title

Line Graph 1

Type indicates which type of plot should be drawn. Some possible options are “p” for points, “l” for lines, “b” for lines covered by points, “o” for lines not covered by points.
Lty (line type) argument controls the type of lines, e.g. “solid“, “dashed”, “dotted”, etc.
Col argument specifies the line color. Numbers can be colors too!

To show the population trend over these years, we can add a basic linear regression to the plot using abline().

> fit <- lm(PopSize ~ Year, data = moomins)
> abline(fit, lty = "dashed")
> text(x = 1978, y = 750, labels = "R2 = 0.896\nP = 2.615e-15")

Line Graph 2

Abline() function also supports the arguments of lty, lwd, col, etc.
Text() function adds texts to the plot at the specified (x, y) coordinates.

4.2.3 Scatterplot with Legend

In this section, we make use of a built-in dataset named iris.

> data("iris")
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Use pairs() function to see plots on every two variables. We specify different colors of points according to their species.

> pairs(iris, col = iris$Species)

Scatterplot 1

Now we come to produce a scatterplot on Sepal.Length and Petal.Length.

> plot(iris$Sepal.Length, iris$Petal.Length,        # X variable, y variable
       col = iris$Species,                          # Color by species
       pch = 16,                                    # Type of points
       cex = 1,                                     # Size of points
       xlab = "Sepal Length",                       # X-axis label
       ylab = "Petal Length",                       # Y-axis label
       main = "Flower Characteristics in Iris")     # Plot title

Scatterplot 2

Pch argument specifies a particular type of point, usually by a number index or a character (e.g. “A”).
The list of indexed point types follows.

Point Types

To tell these points apart, we’d better add a legend to the plot.

> legend(x = 4.5, y = 7, legend = levels(iris$Species), col = 1:3, pch = 16)

这里写图片描述

Note 1: It is possible to specify colors to the data frame.
Note 2: It is possible to specify lines in the legend by using lty instead of pch.

> iris$Colour <- ""
> iris$Colour[iris$Species == "setosa"] <- "magenta"
> iris$Colour[iris$Species == "versicolor"] <- "cyan"
> iris$Colour[iris$Species == "virginica"] <- "yellow"

> plot(iris$Sepal.Length, iris$Petal.Length,
       col = iris$Colour, pch = 16, cex = 1)
> legend(x = 4.5, y = 7, 
         legend = levels(iris$Species),
         col = c("magenta", "cyan", "yellow"),
         lty = "solid")

Scatterplot 3

4.2.4 Boxplot with Reordered and Formatted Axes

Let us continue with the iris dataset.

> boxplot(iris$Sepal.Length ~ iris$Species,              # x variable, y variable
          notch = T,                                     # Draw notch
          las = 1,                                       # Orientate the axis tick labels
          xlab = "Species",                              # X-axis label
          ylab = "Sepal Length",                         # Y-axis label
          main = "Sepal Length by Species in Iris",      # Plot title
          cex.lab = 1.5,                                 # Size of axis labels
          cex.axis = 1.5,                                # Size of the tick mark labels
          cex.main = 2)                                  # Size of the plot title

Boxplot

In a boxplot, the bottom and top of the box are always the first and third quartiles, and the band inside is the median. But the ends of whiskers can represent some alternative values, among which the minimum and the maximum are the most common.

Notch argument allows you to compare the medians of the boxplot.
Las argument allows you to rotate the labels of axis.
Cex.lab, cex.axis and cex.main allows you to specify the size of labels, axes and the title respectively.

To reorder the species list in the label, we can reorder it by specifying a particular order like the following command.

iris$Species <- factor(iris$Species, levels = c("virginica", "versicolor", "setosa"))

4.2.5 Barplot with Error Bars 😦

Drawing barplots with base graphics is not that comfortable.

> dragons <- data.frame(
      TalonLength = c(20.9, 58.3, 35.5),
      SE = c(4.5, 6.3, 5.5),
      Population = c("England", "Scotland", "Wales"))
> dragons
  TalonLength  SE Population
1        20.9 4.5    England
2        58.3 6.3   Scotland
3        35.5 5.5      Wales
> barplot(dragons$TalonLength, names = dragons$Population)

Barplot

But it is really COMPLICATED to draw a barplot with error bars!

4.2.6 Limitations of Base Graphics

You could feasibly do anything you require in base graphics, but some common actions are not straightforward, such as

Legends
Dodged plots
Faceting
Error bars
Formatting axes and plot area

Base graphics are best for quick, simple and exploratory graphics, but they are really time-consuming when it comes to complex graphs.

It is HIGHLY RECOMMENDED to use ggplot2 for statistical graphics.

Advantages of ggplot2:

Consistent underlying grammar of graphics (Wilkinson, 2005)
Plot specification at a high level of abstraction
Very flexible
Theme system for polishing plot appearance
Mature and complete graphics system
Many users, active mailing list

A blog on introduction to ggplot2 will come out this summer.

posted @ 2016-07-20 20:12 二和鶏阅读(451) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

二和鶏与他的极大理想

一定是素理想