[转载]HISAT, StringTie and Ballgown

HISAT, StringTie and Ballgown(一)

 

今天给大家分享一下nature protocol上8月11号刚发的转录组分析新流程HISAT, StringTie and Ballgown(查看原文可以看到)。至此tophat-cufflinks可以说拜拜了。

首先了解一下几个工具作用;

HISAT:比对工具了,类似于昨天讲的tophat2;

Stringtie:组装与定量工具。

Ballgown:为差异表达计算工具

 

主要流程图

 

 

 

具体步骤

 

 

1、创建index

首先利用下面脚本提取剪接信息(有参考GFF前提下,没有忽略此步):

$ extract_splice_sites.py chrX_data/genes/chrX.gtf >chrX.ss

$ extract_exons.py chrX_data/genes/chrX.gtf >chrX.exon

然后构建HISAT2 index:

$ hisat2-build --ss chrX.ss --exon chrX.exon chrX_data/genome/chrX.fa chrX_tran

The --ss and --exon options(没有第一步可以不写)。indexing requires 9 GB of RAM for chromosome X and 160 GB for the whole human genome. The amount of memory is much smaller if one omits annotation information. Indexing chromosome X using one CPU coretakes <10 min. It should take ~2 h to build an index for the whole human genome using eight CPU cores.

 

2、开始比对

各样本分别比对参考基因组

$ hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1

chrX_data/samples/ERR188044_chrX_1.fastq.gz -2

chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam

$ hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1

chrX_data/samples/ERR188104_chrX_1.fastq.gz -2

chrX_data/samples/ERR188104_chrX_2.fastq.gz -S ERR188104_chrX.sam

将SAM 转换为BAM:

$ samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam

$ samtools sort -@ 8 -o ERR188104_chrX.bam ERR188104_chrX.sam

 

3、组装转录本

$ stringtie -p 8 -G chrX_data/genes/chrX.gtf -o

ERR188044_chrX.gtf –l ERR188044 ERR188044_chrX.bam

$ stringtie -p 8 -G chrX_data/genes/chrX.gtf -o

ERR188104_chrX.gtf –l ERR188104 ERR188104_chrX.bam

 

4、合并各个样本

$ stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf

chrX_data/mergelist.txt

chrX_data/mergelist.txt:各个gtf路径放在里面。

 

5、估计表达丰度

$ stringtie –e –B -p 8 -G stringtie_merged.gtf -o

ballgown/ERR188044/ERR188044_chrX.gtf ERR188044_chrX.bam

$ stringtie –e –B -p 8 -G stringtie_merged.gtf -o

ballgown/ERR188104/ERR188104_chrX.gtf ERR188104_chrX.bam

 

6、加载 Ballgown  R包

$ R

R version 3.2.2 (2015-08-14) -- "Fire Safety"

Copyright (C) 2015 The R Foundation for Statistical Computing

Platform: x86_64-apple-darwin13.4.0 (64-bit)

>library(ballgown)

>library(RSkittleBrewer)

>library(genefilter)

>library(dplyr)

>library(devtools)

 

7、 加载表型数据.

An example file called geuvadis_phenodata.csv is included with the data files for this protocol (ChrX_data). In general, you will have to create this file yourself. It contains information about your RNA-seq samples, formatted as illustrated in this csv (comma-separated values) file.

>pheno_data = read.csv("geuvadis_phenodata.csv")

 

8、加载表达丰度文件其来源于stingtie

To do this, we use the ballgown command with the following three parameters: the directory in which the data are stored (dataDir, which here is named simply ‘Ballgown’), a pattern that appears in

the sample names (samplePattern) and the phenotypic information that we loaded in the previous step (pData). Note that once a Ballgown object is created, any other Bioconductor32 package can be applied for data analysis

or data visualization.

>bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)

 

9、过滤低表达基因

>bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)

 

10、鉴定差异转录本

>results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")

 

 

11、鉴定差异基因

>results_genes = stattest(bg_chrX_filt, feature="gene",covariate="sex", adjustvars = c("population"), getFC=TRUE,meas="FPKM")

 

http://blog.sciencenet.cn/blog-1469385-1022768.html 

posted @ 2018-06-21 17:00  萌萌哒_Youzi  阅读(551)  评论(0编辑  收藏  举报