Alpha-ma
2016/10/7
1 Introduction of GGplot2
ggplot2 is an R package for producing statistical, or data, graphics, and it has a deep underlying grammar. This grammar, based on the Grammar of Graphics creadted by Wilkinson. In the grammar of graphics, Wilkinson describe the deep features that underlie all statistical graphics. In brief, the grammar of graphics tells us that a statistical graphic is a mapping from data to aesthetic attributes(color, shape, size) of geometric objects(points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subjects of the dataset. It isthe combination of these independent components that make up a grapihc.
All plots are composed of:
- Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes that you can perceive.
- Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways.
- The scales map values in the data space to values in an aesthetic space, whether it be color, size or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.
- coord is short for a coordinate system, describing how data coordinates are mapped to the plane of the graphic.
- A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.
- A theme which controls the finer points of display, like the font size and background color.
It is also important to talk about what the grammar doesn’t do:
- It doesn’t suggest what graphics you should use to answer the questions you are interested in.
- It doesn’t describe interactivity: the grammar of graphics describes only static graphics and there is essentially no benefit to displaying them on a computer screen as opposed to a piece of paper.
2 Getting started with ggplot2
Every ggplot2 plot has three key components:
- data
- A set of aesthetic mappings between varibles in the data and visual properties.
- At least one layer which describes how to render each observation. Layers are usually created with a geom function.
Here’s a simple example:
library(ggplot2)
p<-ggplot(mpg,aes(x = displ,y = hwy)) + geom_point() ## layer is added on with +; 'x=' and 'y=' can be ignore :aes(displ,hwy).
p
summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
## fl, class [234x11]
## mapping: x = displ, y = hwy
## faceting: facet_null()
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
To add additional variables to a plot, we can use other aesthetics like color, shape and size.
ggplot(mpg,aes(displ,cty,color = class)) + geom_point()
If you want to se an aesthetic to a fixed value, without scaling it, do so in the individual layer outside of aes(). Compare the following two plots:
p1<-ggplot(mpg,aes(displ,hwy)) + geom_point(aes(color = 'blue'))
p2<-ggplot(mpg,aes(displ,hwy)) + geom_point(color = 'blue')
multiplot(p1,p2,cols = 2) # this statement is used to display pictures in one row, a self-built function which is hidden from the doc.
Facetting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset. There are two types of facetting: grid and wrapped.
ggplot(mpg,aes(displ,hwy)) + geom_point() + facet_wrap(~class) # split into subsets by class
plot geoms provide multiple plots, here we show some more of importance:
- geom_smooth() fits a smoother to the data and displays the smooth and its standard error.
- geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points.
- geom_histogram() and geom_freqpoly() show the distribution of continuous varibles
- geom_bar() shows the distribution of categorical varibales
- geom_path() and geom_line() draw lines betweent the data points.
ggplot(mpg,aes(displ,hwy)) + geom_point() + geom_smooth() ## grey bound is point-wise confidence level and can be turn off by geom_smooth(se = FALSE) the wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly)
p1<-ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span = 0.2)
p2<- ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span= 1)
multiplot(p1, p2, cols = 2)
# A important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve: method = 'loess' (default) for small n; method = 'gam' fits a generalized additive model provided by the mgcv package. use a formula like formula =y ~s(x) or y ~ s(x, bs = 'cs') (for large data) this is what ggplot2 uses when there are more than 1000 points
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-15. For overview type 'help("mgcv-package")'.
ggplot(mpg,aes(displ,hwy)) + geom_point() + geom_smooth(method = 'gam', formula = y ~ s(x))
# mthod = 'lm' fits a linear model, giving the line of best fit. method = 'rlm' uses a robust fitting algorithm so that outliers don't affect the fit as much (require MASS package).
ggplot(mpg,aes(displ,hwy)) + geom_point() + geom_smooth(method = 'lm')
boxplot and jittered points when a set of data includes lots of data have the same values, Its plotting will overplot very much, leading to difficult to see the distribution. we can use geom_jitter() and geom_boxplot().
p1<- ggplot(mpg, aes(drv, hwy)) + geom_jitter()
p2<- ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
multiplot(p1,p2,cols = 2)
Histograms and Frequency Polygons show the distribution of a single numeric variable.
p1<- ggplot(mpg,aes(hwy)) + geom_histogram()
p2<- ggplot(mpg,aes(hwy)) + geom_freqpoly()
multiplot(p1,p2,cols = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##you can control the binwidth with the argument 'bindwidth'. It is very of importance to experiment with the bin width. the defualt jut splits your data into 30 bins, which is unlikely to be the best choice. You should always try many bin widths and you find need multiple bin witdth to tell the full story of your data. An alternative to the frequency polygon is the density plot, geom_density().
p3 <- ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 2.5)
p4 <- ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 1)
multiplot(p3, p4, cols = 2)
# To compare the distribution of different subgroups, you can map a categorical varible to either fill ( for geom_histogram()) or color (for geom_freqpoly()).
p5<- ggplot(mpg, aes(displ, color = drv)) + geom_freqpoly(binwidth = 0.5)
p6<- ggplot(mpg, aes(displ, fill = drv)) + geom_histogram(binwidth = 0.5)
p7<- ggplot(mpg, aes(displ, fill = drv)) + geom_histogram(binwidth = 0.5)+ facet_wrap(~drv, ncol = 1)
multiplot(p5, p6, p7,cols = 3)
The discrete analogue of the histogram is the bar chart, geom_bar(). It’s easy to use:
ggplot(mpg,aes(manufacturer)) +geom_bar()
Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join the order that they appera in the dataset( in other words, a line plot is a path plot of the data sorted by x value)
Modifying the Axes: there are two families of useful helpers let you make the most common modifications. * xlab() and ylab() modify the x- and y-axis labels
p1<- ggplot(mpg,aes(cty,hwy)) + geom_point(alpha = 1/ 3)
p2<- ggplot(mpg,aes(cty,hwy)) + geom_point(alpha = 1/ 3) + xlab('city driving (mpg)')+ylab('highway driving (mpg)')
p3<- ggplot(mpg,aes(cty,hwy)) + geom_point(alpha = 1/ 3) + xlab(NULL) + ylab(NULL) #remove the axis labels with NULL
multiplot(p1,p2,p3,cols = 3)
- xlim() and ylim() modify the limits of axes:
p4<- ggplot(mpg, aes(drv,hwy)) + geom_jitter(width = 0.25)
p5<- ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25) + xlim('f','r') + ylim(20,30)
p6<- ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25, na.rm = TRUE) + ylim(NA, 30)
multiplot(p4,p5,p6,cols = 3)
## Warning: Removed 138 rows containing missing values (geom_point).
Output you can save a plot to a varible and manipulate it, like codes above : p1<-...
, p2<- ...
save plots to disk with ggsave()
:ggsave('plot.png',width = 5, height = 5)
.
qplot() is a quick plots method: qplot(displ, hwy, data = mpg)
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架