R 画图

Plot

x <- c(1, 2, 3, 4, 5)  
y <- c(3, 7, 8, 9, 12)
plot(x, y)

其他参数：

type='l'
main="My Graph"
xlab="The x-axis"
ylab="The y axis"
col="red"
cex=2, 点的大小，默认值为1。
pch=25 形状（值从0-25）
lwd=2 如果画的不是line，也会起作用
lty=3 linestyle, 只有在type=line的时候才会起作用 (值从0-6)

0 removes the line

1 displays a solid line

2 displays a dashed line

3 displays a dotted line

4 displays a "dot dashed" line

5 displays a "long dashed" line

6 displays a "two dashed" line

#画两条线,一定是先plot后line

line1 <- c(1,2,3,4,5,10)  
line2 <- c(2,5,7,8,9,10)  
  
plot(line1, type = "l", col = "blue")  
lines(line2, type="l", col = "red")

看完了line

abline(lm(y~x))

abline 画回归线
lm() linear-model 用线性模型拟合回归线

title("AAA")

给图像添加题目

ggplot

library(ggplot) 

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

#区别
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))


ggplot(data = diamonds) + geom_bar(aes(x = cut, fill = cut)) + scale_fill_brewer(palette = "Dark2")

ggsave(file = "mygraph.png", plot = p)

geom_smooth()

lecture 5：Descriptive Statistics

单变量（Univariate variable）可以分为两类：

categorical
quantitative

Understanding data set

Columns: variables$X = (X_1,...,X_p)$
Rows: Individuals $x_i = (x_{i1},...x_{ip})$
Population & sample
A probability model

Univariate variable (单变量)

Categorical Var.

(1) factors (2) numeric data with discrete units

可以画直方图看直观情况，画density。
利用count(penguins, island) %>% knitr::kable()来看

library(palmerpenguins)
ggplot(data = penguins,aes(x = body_mass_g, y = ..density.., group = 1)) +
geom_histogram(color = "black",alpha = 0.7) +
geom_density(color = "red", fill = "red", alpha = 0.2)

Discrete probability models

A random variable is a map from a sample point to a numeric value.(从Categorical数据map成数字)然后就可以求p.m.f,c.d.f,expectation, variance.

在做EDA时，可以看这些：

center: sample mean, sample median
spread: standard deviation, range, quantiles, IQR(Interquartile range即75%-25%)
skewness $$\frac{\sum_{i=1}^{n}(x_i - \bar x)^3}{(n-1)sd3}$$

summary()
sd()

对偏度很大的变量，用中位数和IQR会更准确。

注意：R中没有skewness这个函数。

Quantitative

一般会查看：

density function, 注意它的support$S_X$, $f_X(x)$ 可以大于1！只要本身大于等于0并且定积分是1就可以了。
distribution function
是否服从正态分布

library(ggplot2)
ggplot(data = penguins, aes(sample = body_mass_g)) +
+     stat_qq() +
+     stat_qq_line(col = "red")

Multivariate random variables

Two variables: $(𝑋, 𝑌)$

– Joint distribution
– Marginal distribution
– Conditional distribution
– Mean
– Variance
– Covariance cov()
– Correlation cor()

Both 𝑋 and 𝑌 are categorical variables

可以画叠在一起的图

ggplot(data = penguins, aes(x = species, fill = sex)) +
 geom_bar()

Joint count

(joint_table <- penguins %>%
 xtabs(~species + sex, data = .)) %>%
addmargins()

Joint probability

joint_table %>%
 prop.table() %>%
 round(digit = 3) %>% #保留三位有效数字
  addmargins()

Marginal distribution

library(magrittr)
joint_table %>%
margin.table(1) %T>% #1对X聚合，2对Y聚合
print() %>%
prop.table()

Conditional distribution

joint_table %>% prop.table(margin = 1)#1表示是P(Y|X),2表示P(X|Y)

如果marginal 和 conditional 的值一样，则两个变量独立。

X is categorical and Y is quantitative

可以画出来看一下

ggplot(data = penguins, aes(x = body_mass_g, y = ..density.., fill = species)) +
geom_density(color = "black", alpha = 0.5)


ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(color = "black", alpha = 0.5)

summarize()的时候可以先group by().

penguins %>%
  group_by(species, sex) %>%
  summarize(n = length(body_mass_g),
            mean = mean(body_mass_g),
            sd = sd(body_mass_g))

正态性检验（qq图）

penguins %>%
  filter(species == "Chinstrap") %>%
  ggplot(aes(sample = body_mass_g)) +
  stat_qq() +
  stat_qq_line(col = "red")

!["5-1.png"]

看一下P32,33
$\pi_s$可以视作是当species is Adelie 时的概率。

Classification

Logistic regression
Linear Discriminant Analysis, LDA
Support vector machine

Both X and Y are quantitative

Covariance
Correlation 需要把变量标准化(就是减去均值除以方差)，去除单位的影响,才能进行比较

Random Vectors

\[s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij} - \bar X_{,j})(X_{ik} - \bar X_{,k}) \]

library(palmerpenguins)#penguins
library(dplyr)#%>%
library(tidyverse)#dropna
library
data <- penguins
#data.dropna(axis = 'rows', how = 'all')
data %>% 
  select(where(is.numeric)) %>%
  colMeans() %>%
  knitr::kable()

Generate Random Variables

生成一个常见的分布

d: dnorm(x)density
p: pnorm(x)probability distribution function
q: qnorm(p)quantile function
r: rnorm(n)random generation

pnorm(2.1, lower.tail = FALSE)
dbinom(5,10,0.5)
pbinom(5,10,0.5)

rnorm(100, mean=0,sd=1) %>% hist()

Generate multivatiate normal random variables

library(MASS)
set.seed(1234)
mu <- c(10,20)
sigma <- matrix(c(1,0.5,0.5,1), nrow=2)
mvrnorm(100, mu, sigma) %>%
  plot()

posted @ 2022-12-15 18:08 爱吃番茄的玛丽亚阅读(85) 评论(0) 编辑收藏举报

刷新页面返回顶部

pny01

热爱可抵岁月漫长~