R 画图
Plot
x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 8, 9, 12)
plot(x, y)
其他参数:
type='l'
main="My Graph"
xlab="The x-axis"
ylab="The y axis"
col="red"
cex=2
, 点的大小,默认值为1。pch=25
形状(值从0-25)lwd=2
如果画的不是line,也会起作用lty=3
linestyle, 只有在type=line的时候才会起作用 (值从0-6)
0
removes the line1
displays a solid line2
displays a dashed line3
displays a dotted line4
displays a "dot dashed" line5
displays a "long dashed" line6
displays a "two dashed" line
#画两条线,一定是先plot后line
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
plot(line1, type = "l", col = "blue")
lines(line2, type="l", col = "red")
看完了line
abline(lm(y~x))
abline
画回归线lm() linear-model
用线性模型拟合回归线
title("AAA")
给图像添加题目
ggplot
library(ggplot)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
#区别
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = diamonds) + geom_bar(aes(x = cut, fill = cut)) + scale_fill_brewer(palette = "Dark2")
ggsave(file = "mygraph.png", plot = p)
- geom_smooth()
lecture 5:Descriptive Statistics
单变量(Univariate variable)可以分为两类:
- categorical
- quantitative
Understanding data set
- Columns: variables\(X = (X_1,...,X_p)\)
- Rows: Individuals \(x_i = (x_{i1},...x_{ip})\)
- Population & sample
- A probability model
Univariate variable (单变量)
Categorical Var.
(1) factors (2) numeric data with discrete units
- 可以画直方图看直观情况,画density。
- 利用
count(penguins, island) %>% knitr::kable()
来看
library(palmerpenguins)
ggplot(data = penguins,aes(x = body_mass_g, y = ..density.., group = 1)) +
geom_histogram(color = "black",alpha = 0.7) +
geom_density(color = "red", fill = "red", alpha = 0.2)
Discrete probability models
A random variable is a map from a sample point to a numeric value.(从Categorical数据map成数字)然后就可以求p.m.f,c.d.f,expectation, variance.
在做EDA时,可以看这些:
- center: sample mean, sample median
- spread: standard deviation, range, quantiles, IQR(Interquartile range即75%-25%)
- skewness $$\frac{\sum_{i=1}^{n}(x_i - \bar x)3}{(n-1)sd3}$$
summary()
sd()
对偏度很大的变量,用中位数和IQR会更准确。
注意:R中没有skewness这个函数。
Quantitative
一般会查看:
- density function, 注意它的support\(S_X\), \(f_X(x)\) 可以大于1!只要本身大于等于0并且定积分是1就可以了。
- distribution function
- 是否服从正态分布
library(ggplot2)
ggplot(data = penguins, aes(sample = body_mass_g)) +
+ stat_qq() +
+ stat_qq_line(col = "red")
Multivariate random variables
Two variables: \((𝑋, 𝑌)\)
– Joint distribution
– Marginal distribution
– Conditional distribution
– Mean
– Variance
– Covariance cov()
– Correlation cor()
Both 𝑋 and 𝑌 are categorical variables
- 可以画叠在一起的图
ggplot(data = penguins, aes(x = species, fill = sex)) +
geom_bar()
- Joint count
(joint_table <- penguins %>%
xtabs(~species + sex, data = .)) %>%
addmargins()
- Joint probability
joint_table %>%
prop.table() %>%
round(digit = 3) %>% #保留三位有效数字
addmargins()
- Marginal distribution
library(magrittr)
joint_table %>%
margin.table(1) %T>% #1对X聚合,2对Y聚合
print() %>%
prop.table()
- Conditional distribution
joint_table %>% prop.table(margin = 1)#1表示是P(Y|X),2表示P(X|Y)
如果marginal 和 conditional 的值一样,则两个变量独立。
X is categorical and Y is quantitative
- 可以画出来看一下
ggplot(data = penguins, aes(x = body_mass_g, y = ..density.., fill = species)) +
geom_density(color = "black", alpha = 0.5)
ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(color = "black", alpha = 0.5)
summarize()
的时候可以先group by()
.
penguins %>%
group_by(species, sex) %>%
summarize(n = length(body_mass_g),
mean = mean(body_mass_g),
sd = sd(body_mass_g))
- 正态性检验(qq图)
penguins %>%
filter(species == "Chinstrap") %>%
ggplot(aes(sample = body_mass_g)) +
stat_qq() +
stat_qq_line(col = "red")
!["5-1.png"]
看一下P32,33
\(\pi_s\)可以视作是当species is Adelie 时的概率。
Classification
- Logistic regression
- Linear Discriminant Analysis, LDA
- Support vector machine
Both X and Y are quantitative
- Covariance
- Correlation 需要把变量标准化(就是减去均值除以方差),去除单位的影响,才能进行比较
Random Vectors
\[s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij} - \bar X_{,j})(X_{ik} - \bar X_{,k})
\]
library(palmerpenguins)#penguins
library(dplyr)#%>%
library(tidyverse)#dropna
library
data <- penguins
#data.dropna(axis = 'rows', how = 'all')
data %>%
select(where(is.numeric)) %>%
colMeans() %>%
knitr::kable()
Generate Random Variables
生成一个常见的分布
- d:
dnorm(x)
density - p:
pnorm(x)
probability distribution function - q:
qnorm(p)
quantile function - r:
rnorm(n)
random generation
pnorm(2.1, lower.tail = FALSE)
dbinom(5,10,0.5)
pbinom(5,10,0.5)
rnorm(100, mean=0,sd=1) %>% hist()
Generate multivatiate normal random variables
library(MASS)
set.seed(1234)
mu <- c(10,20)
sigma <- matrix(c(1,0.5,0.5,1), nrow=2)
mvrnorm(100, mu, sigma) %>%
plot()