R 画图

Plot

x <- c(1, 2, 3, 4, 5)  
y <- c(3, 7, 8, 9, 12)
plot(x, y)

其他参数:

  • type='l'
  • main="My Graph"
  • xlab="The x-axis"
  • ylab="The y axis"
  • col="red"
  • cex=2, 点的大小,默认值为1。
  • pch=25 形状(值从0-25)
  • lwd=2 如果画的不是line,也会起作用
  • lty=3 linestyle, 只有在type=line的时候才会起作用 (值从0-6)
  • 0 removes the line
  • 1 displays a solid line
  • 2 displays a dashed line
  • 3 displays a dotted line
  • 4 displays a "dot dashed" line
  • 5 displays a "long dashed" line
  • 6 displays a "two dashed" line
#画两条线,一定是先plot后line

line1 <- c(1,2,3,4,5,10)  
line2 <- c(2,5,7,8,9,10)  
  
plot(line1, type = "l", col = "blue")  
lines(line2, type="l", col = "red")

看完了line

abline(lm(y~x))

  • abline 画回归线
  • lm() linear-model 用线性模型拟合回归线

title("AAA")

给图像添加题目

ggplot

library(ggplot) 

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

#区别
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))


ggplot(data = diamonds) + geom_bar(aes(x = cut, fill = cut)) + scale_fill_brewer(palette = "Dark2")

ggsave(file = "mygraph.png", plot = p)

  • geom_smooth()

lecture 5:Descriptive Statistics

单变量(Univariate variable)可以分为两类:

  • categorical
  • quantitative

Understanding data set

  • Columns: variables\(X = (X_1,...,X_p)\)
  • Rows: Individuals \(x_i = (x_{i1},...x_{ip})\)
  • Population & sample
  • A probability model

Univariate variable (单变量)

Categorical Var.

(1) factors (2) numeric data with discrete units

  • 可以画直方图看直观情况,画density。
  • 利用count(penguins, island) %>% knitr::kable()来看
library(palmerpenguins)
ggplot(data = penguins,aes(x = body_mass_g, y = ..density.., group = 1)) +
geom_histogram(color = "black",alpha = 0.7) +
geom_density(color = "red", fill = "red", alpha = 0.2)

Discrete probability models

A random variable is a map from a sample point to a numeric value.(从Categorical数据map成数字)然后就可以求p.m.f,c.d.f,expectation, variance.

在做EDA时,可以看这些:

  • center: sample mean, sample median
  • spread: standard deviation, range, quantiles, IQR(Interquartile range即75%-25%)
  • skewness $$\frac{\sum_{i=1}^{n}(x_i - \bar x)3}{(n-1)sd3}$$
summary()
sd()

对偏度很大的变量,用中位数和IQR会更准确。

注意:R中没有skewness这个函数。

Quantitative

一般会查看:

  • density function, 注意它的support\(S_X\), \(f_X(x)\) 可以大于1!只要本身大于等于0并且定积分是1就可以了。
  • distribution function
  • 是否服从正态分布
library(ggplot2)
ggplot(data = penguins, aes(sample = body_mass_g)) +
+     stat_qq() +
+     stat_qq_line(col = "red")

Multivariate random variables

Two variables: \((𝑋, 𝑌)\)

– Joint distribution
– Marginal distribution
– Conditional distribution
– Mean
– Variance
– Covariance cov()
– Correlation cor()

Both 𝑋 and 𝑌 are categorical variables

  • 可以画叠在一起的图
ggplot(data = penguins, aes(x = species, fill = sex)) +
 geom_bar()

  • Joint count
(joint_table <- penguins %>%
 xtabs(~species + sex, data = .)) %>%
addmargins()
  • Joint probability
joint_table %>%
 prop.table() %>%
 round(digit = 3) %>% #保留三位有效数字
  addmargins()
  • Marginal distribution
library(magrittr)
joint_table %>%
margin.table(1) %T>% #1对X聚合,2对Y聚合
print() %>%
prop.table()
  • Conditional distribution
joint_table %>% prop.table(margin = 1)#1表示是P(Y|X),2表示P(X|Y)

如果marginal 和 conditional 的值一样,则两个变量独立。

X is categorical and Y is quantitative

  • 可以画出来看一下
ggplot(data = penguins, aes(x = body_mass_g, y = ..density.., fill = species)) +
geom_density(color = "black", alpha = 0.5)


ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(color = "black", alpha = 0.5)

  • summarize()的时候可以先group by().
penguins %>%
  group_by(species, sex) %>%
  summarize(n = length(body_mass_g),
            mean = mean(body_mass_g),
            sd = sd(body_mass_g))
  • 正态性检验(qq图)
penguins %>%
  filter(species == "Chinstrap") %>%
  ggplot(aes(sample = body_mass_g)) +
  stat_qq() +
  stat_qq_line(col = "red")

!["5-1.png"]

看一下P32,33
\(\pi_s\)可以视作是当species is Adelie 时的概率。

Classification

  • Logistic regression
  • Linear Discriminant Analysis, LDA
  • Support vector machine

Both X and Y are quantitative

  • Covariance
  • Correlation 需要把变量标准化(就是减去均值除以方差),去除单位的影响,才能进行比较

Random Vectors

\[s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij} - \bar X_{,j})(X_{ik} - \bar X_{,k}) \]

library(palmerpenguins)#penguins
library(dplyr)#%>%
library(tidyverse)#dropna
library
data <- penguins
#data.dropna(axis = 'rows', how = 'all')
data %>% 
  select(where(is.numeric)) %>%
  colMeans() %>%
  knitr::kable()


Generate Random Variables

生成一个常见的分布

  • d: dnorm(x)density
  • p: pnorm(x)probability distribution function
  • q: qnorm(p)quantile function
  • r: rnorm(n)random generation
pnorm(2.1, lower.tail = FALSE)
dbinom(5,10,0.5)
pbinom(5,10,0.5)
rnorm(100, mean=0,sd=1) %>% hist()

Generate multivatiate normal random variables

library(MASS)
set.seed(1234)
mu <- c(10,20)
sigma <- matrix(c(1,0.5,0.5,1), nrow=2)
mvrnorm(100, mu, sigma) %>%
  plot()
posted @ 2022-12-15 18:08  爱吃番茄的玛丽亚  阅读(64)  评论(0编辑  收藏  举报