R:Stamp分析(Wilcoxon秩和检验版)

 

 

参考的这篇文章,此文章给出了Wilcoxon秩和检验的方案,但是好像是错误的,无法成功运行,只能进行t.test分析,在此基础上,我做了一些修改,完美适配Wilcoxon秩和检验

t 检验

  • 假设:t 检验假设数据是从正态分布中抽取的,且两组的方差相等(对于独立双样本 t 检验)。如果你计划使用配对 t 检验,还需要假设差异的正态性。
  • 适用性:当你的数据接近正态分布,且两个独立样本的方差相似时,t 检验是合适的。如果样本量较大(通常大于30),根据中心极限定理,t 检验也可能是适当的,即使原始数据不完全符合正态分布。
  • 优点:在满足正态分布的假设下,t 检验是一种有效的方法,可以提供关于均值差异的精确估计。

Wilcoxon 秩和检验

  • 假设:Wilcoxon 秩和检验是一种非参数检验,不需要数据服从特定的分布,比如正态分布。
  • 适用性:当数据不满足正态分布假设,或者样本量较小,难以准确判断其分布情况时,Wilcoxon 秩和检验通常是更合适的选择。它也适用于有明显偏态(skewed)的数据或含有异常值的情况。
  • 优点:不需要严格的分布假设,适用范围更广。
# 清除所有变量
rm(list = ls())
# 设置工作目录
setwd("C:\\Users\\Administrator\\Desktop\\新建文件夹\\Stamp")
# 导入数据
data <- read.table("1.txt",header = TRUE,row.names = 1,sep = "\t")  #第一行包含列标题
group <- read.table("2.txt",header = FALSE,sep = "\t")
# 加载必要的包
library(tidyverse)
library(boot) #用于执行bootstrap方法,这将在后续步骤中用于估计置信区间。

data <- data*100 #放大数据值,数据框中的所有数值乘以100
data <- data %>% filter(apply(data,1,mean) > 0.5) #保留了那些平均值大于0.5的行
data <- t(data) #转置数据
data1 <- data.frame(data,group$V2) #与分组信息的第二列合并为一个新的数据框
colnames(data1) <- c(colnames(data),"Group") #添加列名
data1$Group <- as.factor(data1$Group) #将分组列转换为因子

# Wilcoxon秩和检验
diff <- data1 %>% 
  select_if(is.numeric) %>% #选择所有的数值型列
  map_df(~ broom::tidy(wilcox.test(. ~ Group,data = data1)), .id = 'var') #执行Wilcoxon秩和检验

diff$p.value <- p.adjust(diff$p.value,"bonferroni") #对p值进行Bonferroni校正
diff <- diff %>% filter(p.value < 0.05) #筛选出调整后仍然显著(p值小于0.05)的结果

# 定义一个函数来计算中位数差异
median_diff <- function(data, indices) {
  d <- data[indices, ]  # 从数据中抽样
  diff <- median(d[d$Group == levels(d$Group)[1], 'value']) - #计算两组的中位数差异
    median(d[d$Group == levels(d$Group)[2], 'value']) #计算得到的两个中位数之差(第一组中位数减去第二组中位数)
  return(diff) #返回中位数差异
}

# 为每个变量应用bootstrap
results <- list() #创建一个空列表
for (var in colnames(data1[, sapply(data1, is.numeric)])) { #遍历所有数值型列的名称
  # 将数据框准备为boot函数所需的格式
  boot_data <- data1[, c(var, 'Group'), drop = FALSE] #个变量创建一个新的数据框
  colnames(boot_data) <- c('value', 'Group') #设置列名
  
  # 应用bootstrap方法
  boot_res <- boot(data = boot_data, statistic = median_diff, R = 1000) #执行bootstrap分析,进行1000次重采样
  
  # 计算置信区间
  conf_int <- boot.ci(boot_res, type = "perc") #计算bootstrap结果的百分位数置信区间
  
  # 存储结果
  results[[var]] <- list(estimate = boot_res$t0, conf.low = conf_int$percent[4], conf.high = conf_int$percent[5])
}

# 在循环之前初始化列
diff$estimate <- NA  # 初始化estimate列为NA
diff$conf.low <- NA  # 初始化conf.low列为NA
diff$conf.high <- NA  # 初始化conf.high列为NA

# 现在遍历results列表,提取结果并合并到diff中
for (var in names(results)) {
  # 获取当前变量的bootstrap结果
  boot_res <- results[[var]]
  
  # 在diff中找到对应的变量行,并添加bootstrap估计和置信区间
  diff$estimate[diff$var == var] <- boot_res$estimate
  diff$conf.low[diff$var == var] <- boot_res$conf.low
  diff$conf.high[diff$var == var] <- boot_res$conf.high
  #通过这个循环,每个变量的bootstrap分析结果(中位数差异估计值和置信区间)被添加到diff数据框中,对应于原先通过Wilcoxon秩和检验得到的结果。
  #这样,diff数据框现在包含了完整的分析结果,既有原始的Wilcoxon秩和检验结果,也有通过bootstrap方法估计的中位数差异及其置信区间。
}

# 保存diff到文本文件
write.table(diff, file = "diff_results.txt", quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t")

# 绘图数据构建
# 左侧条形图
abun.bar <- data1[,c(diff$var,"Group")] %>%  #选取数据
  gather(variable,value,-Group) %>% #将宽格式数据转换为长格式
  group_by(variable,Group) %>% #按变量和组别分组
  summarise(Mean = mean(value)) #计算平均值

# 右侧散点图
diff.mean <- diff[,c("var","estimate","conf.low","conf.high","p.value")] #提取数据
diff.mean$Group <- c(ifelse(diff.mean$estimate >0,levels(data1$Group)[1],
                            levels(data1$Group)[2])) #分配组别,果estimate大于0,表示第一组的中位数高于第二组,变量则被分配到data1$Group的第一级别;否则,分配到第二级别。
diff.mean <- diff.mean[order(diff.mean$estimate,decreasing = TRUE),] #根据estimate值重新排序


# 左侧条形图
library(ggplot2)
cbbPalette <- c("#8FC9E2", "#ECC97F")
abun.bar$variable <- factor(abun.bar$variable,levels = rev(diff.mean$var))
p1 <- ggplot(abun.bar,aes(variable,Mean,fill = Group)) +
  scale_x_discrete(limits = levels(diff.mean$var)) +
  coord_flip() +
  xlab("") +
  ylab("Mean proportion (%)") +
  theme(panel.background = element_rect(fill = 'transparent'),
        panel.grid = element_blank(),
        axis.ticks.length = unit(0.4,"lines"), 
        axis.ticks = element_line(color='black'),
        axis.line = element_line(colour = "black"),
        axis.title.x=element_text(colour='black', size=12,face = "bold"),
        axis.text=element_text(colour='black',size=10,face = "bold"),
        legend.title=element_blank(),
        legend.text=element_text(size=12,face = "bold",colour = "black",
                                 margin = margin(r = 20)),
        legend.position = c(-1,-0.1),
        legend.direction = "horizontal",
        legend.key.width = unit(0.8,"cm"),
        legend.key.height = unit(0.5,"cm"))

for (i in 1:(nrow(diff.mean) - 1)) 
  p1 <- p1 + annotate('rect', xmin = i+0.5, xmax = i+1.5, ymin = -Inf, ymax = Inf, 
                      fill = ifelse(i %% 2 == 0, 'white', 'gray95'))

p1 <- p1 + 
  geom_bar(stat = "identity",position = "dodge",width = 0.7,colour = "black") +
  scale_fill_manual(values=cbbPalette)

# 右侧散点图
diff.mean$var <- factor(diff.mean$var,levels = levels(abun.bar$variable))
diff.mean$p.value <- signif(diff.mean$p.value,3)
diff.mean$p.value <- as.character(diff.mean$p.value)
p2 <- ggplot(diff.mean,aes(var,estimate,fill = Group)) +
  theme(panel.background = element_rect(fill = 'transparent'),
        panel.grid = element_blank(),
        axis.ticks.length = unit(0.4,"lines"), 
        axis.ticks = element_line(color='black'),
        axis.line = element_line(colour = "black"),
        axis.title.x=element_text(colour='black', size=12,face = "bold"),
        axis.text=element_text(colour='black',size=10,face = "bold"),
        axis.text.y = element_blank(),
        legend.position = "none",
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.title = element_text(size = 15,face = "bold",colour = "black",hjust = 0.5)) +
  scale_x_discrete(limits = levels(diff.mean$var)) +
  coord_flip() +
  xlab("") +
  ylab("Difference in median proportions (%)") +
  labs(title="95% confidence intervals") 

for (i in 1:(nrow(diff.mean) - 1)) 
  p2 <- p2 + annotate('rect', xmin = i+0.5, xmax = i+1.5, ymin = -Inf, ymax = Inf, 
                      fill = ifelse(i %% 2 == 0, 'white', 'gray95'))

p2 <- p2 +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), 
                position = position_dodge(0.8), width = 0.5, size = 0.5) +
  geom_point(shape = 21,size = 3) +
  scale_fill_manual(values=cbbPalette) +
  geom_hline(aes(yintercept = 0), linetype = 'dashed', color = 'black')

# 将p.value列从字符型转换为数值型
diff.mean$p.value <- as.numeric(diff.mean$p.value)
# 创建一个新列'significance',基于p值的大小转换为星号表示
diff.mean$significance <- ifelse(diff.mean$p.value < 0.001, '***',
                                 ifelse(diff.mean$p.value < 0.01, '**',
                                        ifelse(diff.mean$p.value < 0.05, '*', 'ns')))

# 更新p3图表代码,用'significance'代替p值
p3 <- ggplot(diff.mean, aes(var, estimate, fill = Group)) +
  geom_text(aes(y = 0, x = var), label = diff.mean$significance,
            hjust = 0, fontface = "bold", inherit.aes = FALSE, size = 6) +
  geom_text(aes(x = nrow(diff.mean)/2 +0.5, y = 0.85), label = "Significance",
            srt = 90, fontface = "bold", size = 5) +
  coord_flip() +
  ylim(c(0,1)) +
  theme(panel.background = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank())
# 图像拼接
library(patchwork)
p <- p1 + p2 + p3 + plot_layout(widths = c(4,6,2))

# 保存图像为PNG
ggsave("stamp.png", p, width = 10, height = 8, dpi = 1200)

Wilcoxon秩和检验关注的是中位数而不是均值,所以将右侧图改为中位数了。


t 检验主要关注均值的比较。它检验两个独立样本的均值是否存在显著差异。

Wilcoxon 秩和检验关注的是中位数或分布的整体情况。它是基于秩次的比较,而不直接比较均值。

posted @ 2024-03-13 17:20  王哲MGG_AI  阅读(410)  评论(0编辑  收藏  举报