单倍型分析
1.根据GWAS结果找到一系列基因
- 推荐使用rMVP,mrMLM等R语言包,速度快,直接出图
2. 对得到的结果利用snpEff或者annovar进行注释
- 注意选择对应的基因组文件(及染色体chr1/1)
- 为了搭配后面用的candihap软件,这里使用annovar进行注释
3.选择candihap软件进行单倍型分析
- 由于R包版本始终安装包不上,这里安装了对应的windows版本,容易受到很多限制
- 由于默认进行基因上下游序列也进行提取加入单倍型分析,所以需要手动修改,这里写了一个shell脚本,可以输入目标基因,然后提取目标基因区间的hmp文件
#!/bin/bash
#author lee
echo "gene id is: $1";
grep -i "$1" gene_chr_start_end >gene_pos;
#分别提取基因对应的染色体,起始位置
chr=$(awk '{print $2}' gene_pos);
start=$(awk '{print $3}' gene_pos);
end=$(awk '{print $4}' gene_pos);
#提取区间内对应的行作为基因的hmp文件
less haplotypes.hmp | awk '{if($1=="#CHROM"){print $0} else if($1=="'${chr}'" && $2>="'${start}'" && $2<="'${end}'"){print $0}}' > ${1}.hmp;
#检查一下
echo "works done,must check"
也可利用基因列表批量提取hmp文件
#!/bin/bash
#batch gene2hmp
#author lee
#2021.9.1
#gene_chr_start_end.txt can be extracted by gff3
cat $1|while read line
do
gene=$(echo $line)
echo "$gene is extracting"
grep -i "$gene" gene_chr_start_end.txt >gene_pos;
awk '{print $2}' gene_pos >chr;
chr=$(cat chr);
awk '{print $3}' gene_pos >start;
start=$(cat start);
awk '{print $4}' gene_pos >end;
end=$(cat end);
less haplotypes.hmp | awk '{if($1=="#CHROM"){print $0} else if($1=="'${chr}'" && $2>="'${start}'" && $2<="'${end}'"){print $0}}' > ${gene}.hmp;
echo "fininshed"
done
### 4.得到hmp文件后,利用candihap进行分析
1. 使用candihap模块
2. 其余按照要求导入,无特别注意的地方
3. 点击运行后,可以得到很多结果,但是这里缺少一些想要的结果
4. 单倍型和表型的结果,可利用AI的页面设置截取前几个主要单倍型
### 5.利用candihap得到的txt文件进一步分析不同亚型之间的差异
1. 这里写了一个脚本,先提取得到单倍型大于20的水稻ID
!/bin/bash
extract hap,>20,:,find_in_region
author lee
去除文件的表头,统计单倍型数目
grep '^Hap' $1 >remove_head.hap
name=$(head -n4 $1 |awk -F ":" '{print $3}')
number=$(head -n1 $1 |awk '{print NF}')
筛选数目大于20的单倍型进行统计
awk -v number=$number '{if ($number>=20) print $0}' remove_head.hap>hap_filter
去除文件中的符号:,
sed -i 's/[:,]//g' hap_filter
从ID列到倒数第三列,提取偶数列和第一列
num1=expr $number + 1
awk -v num1=$num1 '{for (i=num1;i<=(NF-3);i+=2) print $1,$i}' hap_filter>id_filter
cp id_filter ${name}
cp $name F:/R/data/id2type
下面是提示信息
num2=expr $number - 2
echo "Total $num2 SNPs"
count=$(awk 'END{print NR}' hap_filter )
echo "Total $count haplotypes more than 20,Details in hap_filter"
count1=$(awk 'END{print NR}' id_filter)
count2=$(awk -v number=$number '{sum += $"'$number'"};END{print sum}' hap_filter)
echo "Haps in id_filter is $count1,haps in hap_filter is $count2"
echo "use less -SN id_filter to check"
下面在R语言下执行
plot haplotye >20 by types or maybe with geomap
author lee
time 2021-8-12
------------------
plot haplotye >20 by types or maybe with geomap
author lee
time 2021-8-12
------------------
setwd("F:/R/data/LOC_Os01g42380")
setwd("F:/R/data/id2type")
library(lookup)
library(ggplot2)
library(tidyverse)
windowsFonts(myFont = windowsFont("Times New Roman"))
id<-read.table("id_filter",header = FALSE)
type<-read.table("id2type.txt",header = TRUE)
提取单倍型对应个体的生态型
id$eco<-lookup(x=id$V2,key=type$ID,value = type$SUBPOPULATION)
data<-id[,c(1,3)]
画出单倍型和全部亚群的关系
p1<-ggplot(data = data,aes(x=V1,y=eco))+geom_count()+ scale_fill_grey(start=1, end=0)+
labs(x="Haplotypes",y="Subpopulation",title = "Haplotype of Subpopulation")
ggsave(p1,filename = "12subpopulation.pdf",width = 5.25,height = 4)
ggsave(p1,filename = "12subpopulation.png")
画出单倍型和籼稻和粳稻两种的关系
ind_jap<-filter(data,eco %in% c("trop","temp","subtrop","japx","ind1A","ind1B","ind2","ind3","indx"))
ind_jap$Population[ind_jap$eco"ind1A"]<-"indica"
ind_jap$Population[ind_jap$eco"ind1B"]<-"indica"
ind_jap$Population[ind_jap$eco"indx"]<-"indica"
ind_jap$Population[ind_jap$eco"ind2"]<-"indica"
ind_jap$Population[ind_jap$eco"ind3"]<-"indica"
ind_jap$Population[ind_jap$eco"trop"]<-"japonica"
ind_jap$Population[ind_jap$eco"temp"]<-"japonica"
ind_jap$Population[ind_jap$eco"subtrop"]<-"japonica"
ind_jap$Population[ind_jap$eco=="japx"]<-"japonica"
p2<-ggplot(data = ind_jap,aes(x=V1,fill=Population))+geom_bar(position="stack")+
labs(x="Haplotypes",y="Subpopulation",title = "Haplotype of Subpopulation")+
theme(legend.text = element_text(face = "italic",family = "serif"))+
theme(legend.title = element_text(face = "bold",family = "serif"))+
theme(axis.text.x = element_text(size = 10,colour = "black",face = "bold",family = "serif"))+
theme(axis.text.y = element_text(size = 10,colour = "black",face = "bold",family = "serif"))+
theme(axis.title.x = element_text(size = 12,face = "bold",family = "serif"))+
theme(axis.title.y = element_text(size = 12,face = "bold",family = "serif"))+
scale_fill_manual(values = c("gray60","gray80"))
ggsave(p2,filename = "2subpopulation.png")
ggsave(p2,filename = "2subpopulation.tiff")
family,设置字体,serif就是新罗马
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)