单倍型分析

1.根据GWAS结果找到一系列基因

推荐使用rMVP,mrMLM等R语言包，速度快，直接出图

2. 对得到的结果利用snpEff或者annovar进行注释

注意选择对应的基因组文件（及染色体chr1/1)
为了搭配后面用的candihap软件，这里使用annovar进行注释

3.选择candihap软件进行单倍型分析

由于R包版本始终安装包不上，这里安装了对应的windows版本，容易受到很多限制
由于默认进行基因上下游序列也进行提取加入单倍型分析，所以需要手动修改，这里写了一个shell脚本，可以输入目标基因，然后提取目标基因区间的hmp文件

#!/bin/bash
#author lee

echo "gene id is: $1";
grep -i "$1" gene_chr_start_end >gene_pos;
#分别提取基因对应的染色体，起始位置
chr=$(awk '{print $2}' gene_pos);
start=$(awk '{print $3}' gene_pos);
end=$(awk '{print $4}' gene_pos);
#提取区间内对应的行作为基因的hmp文件
less haplotypes.hmp | awk '{if($1=="#CHROM"){print $0} else if($1=="'${chr}'" && $2>="'${start}'" && $2<="'${end}'"){print $0}}' > ${1}.hmp;
#检查一下
echo "works done,must check"

也可利用基因列表批量提取hmp文件

#!/bin/bash
#batch gene2hmp
#author lee
#2021.9.1
#gene_chr_start_end.txt can be extracted by gff3
cat $1|while read line
do
	gene=$(echo $line)
  echo "$gene is extracting"
  grep -i "$gene" gene_chr_start_end.txt >gene_pos;
  awk '{print $2}' gene_pos >chr;
  chr=$(cat chr);
  awk '{print $3}' gene_pos >start;
  start=$(cat start);
  awk '{print $4}' gene_pos >end;
  end=$(cat end);
  less haplotypes.hmp | awk '{if($1=="#CHROM"){print $0} else if($1=="'${chr}'" && $2>="'${start}'" && $2<="'${end}'"){print $0}}' > ${gene}.hmp;
  echo "fininshed"
done




### 4.得到hmp文件后，利用candihap进行分析
1. 使用candihap模块
2. 其余按照要求导入，无特别注意的地方
3. 点击运行后，可以得到很多结果，但是这里缺少一些想要的结果
4. 单倍型和表型的结果，可利用AI的页面设置截取前几个主要单倍型
### 5.利用candihap得到的txt文件进一步分析不同亚型之间的差异
1. 这里写了一个脚本,先提取得到单倍型大于20的水稻ID

!/bin/bash

extract hap,>20,:,find_in_region

author lee

去除文件的表头，统计单倍型数目

grep '^Hap' $1 >remove_head.hap
name=$(head -n4 $1 |awk -F ":" '{print $3}')
number=$(head -n1 $1 |awk '{print NF}')

筛选数目大于20的单倍型进行统计

awk -v number=$number '{if ($number>=20) print $0}' remove_head.hap>hap_filter

去除文件中的符号：，

sed -i 's/[:,]//g' hap_filter

从ID列到倒数第三列，提取偶数列和第一列

num1=expr $number + 1
awk -v num1=$num1 '{for (i=num1;i<=(NF-3);i+=2) print $1,$i}' hap_filter>id_filter
cp id_filter ${name}
cp $name F:/R/data/id2type

下面是提示信息

num2=expr $number - 2
echo "Total $num2 SNPs"
count=$(awk 'END{print NR}' hap_filter )
echo "Total $count haplotypes more than 20，Details in hap_filter"
count1=$(awk 'END{print NR}' id_filter)
count2=$(awk -v number=$number '{sum += $"'$number'"};END{print sum}' hap_filter)
echo "Haps in id_filter is $count1,haps in hap_filter is $count2"
echo "use less -SN id_filter to check"

下面在R语言下执行

plot haplotye >20 by types or maybe with geomap

author lee

time 2021-8-12

------------------

plot haplotye >20 by types or maybe with geomap

author lee

time 2021-8-12

------------------

setwd("F:/R/data/LOC_Os01g42380")

setwd("F:/R/data/id2type")

library(lookup)
library(ggplot2)
library(tidyverse)

windowsFonts(myFont = windowsFont("Times New Roman"))

id<-read.table("id_filter",header = FALSE)
type<-read.table("id2type.txt",header = TRUE)

提取单倍型对应个体的生态型

id$eco<-lookup(x=id$V2,key=type$ID,value = type$SUBPOPULATION)
data<-id[,c(1,3)]

画出单倍型和全部亚群的关系

p1<-ggplot(data = data,aes(x=V1,y=eco))+geom_count()+ scale_fill_grey(start=1, end=0)+
labs(x="Haplotypes",y="Subpopulation",title = "Haplotype of Subpopulation")
ggsave(p1,filename = "12subpopulation.pdf",width = 5.25,height = 4)
ggsave(p1,filename = "12subpopulation.png")

画出单倍型和籼稻和粳稻两种的关系

ind_jap<-filter(data,eco %in% c("trop","temp","subtrop","japx","ind1A","ind1B","ind2","ind3","indx"))

ind_jap$Population[ind_jap$eco"ind1A"]<-"indica"
ind_jap$Population[ind_jap$eco"ind1B"]<-"indica"
ind_jap$Population[ind_jap$eco"indx"]<-"indica"
ind_jap$Population[ind_jap$eco"ind2"]<-"indica"
ind_jap$Population[ind_jap$eco"ind3"]<-"indica"
ind_jap$Population[ind_jap$eco"trop"]<-"japonica"
ind_jap$Population[ind_jap$eco"temp"]<-"japonica"
ind_jap$Population[ind_jap$eco"subtrop"]<-"japonica"
ind_jap$Population[ind_jap$eco=="japx"]<-"japonica"

p2<-ggplot(data = ind_jap,aes(x=V1,fill=Population))+geom_bar(position="stack")+
labs(x="Haplotypes",y="Subpopulation",title = "Haplotype of Subpopulation")+
theme(legend.text = element_text(face = "italic",family = "serif"))+
theme(legend.title = element_text(face = "bold",family = "serif"))+
theme(axis.text.x = element_text(size = 10,colour = "black",face = "bold",family = "serif"))+
theme(axis.text.y = element_text(size = 10,colour = "black",face = "bold",family = "serif"))+
theme(axis.title.x = element_text(size = 12,face = "bold",family = "serif"))+
theme(axis.title.y = element_text(size = 12,face = "bold",family = "serif"))+
scale_fill_manual(values = c("gray60","gray80"))
ggsave(p2,filename = "2subpopulation.png")
ggsave(p2,filename = "2subpopulation.tiff")

family，设置字体，serif就是新罗马

发表于 2021-08-12 19:20 萧飒阅读(2522) 评论(0) 收藏举报

从基因到单倍型-水稻GWAS

单倍型分析

1.根据GWAS结果找到一系列基因

2. 对得到的结果利用snpEff或者annovar进行注释

3.选择candihap软件进行单倍型分析

也可利用基因列表批量提取hmp文件

!/bin/bash

extract hap,>20,:,find_in_region

author lee

去除文件的表头，统计单倍型数目

筛选数目大于20的单倍型进行统计

去除文件中的符号：，

从ID列到倒数第三列，提取偶数列和第一列

下面是提示信息

下面在R语言下执行

plot haplotye >20 by types or maybe with geomap

author lee

time 2021-8-12

------------------

plot haplotye >20 by types or maybe with geomap

author lee

time 2021-8-12

------------------

setwd("F:/R/data/id2type")

windowsFonts(myFont = windowsFont("Times New Roman"))

提取单倍型对应个体的生态型

画出单倍型和全部亚群的关系

画出单倍型和籼稻和粳稻两种的关系

family，设置字体，serif就是新罗马

公告