maker3实战分析

一、前言

参考自官网流程:
https://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Example_MAKER_Annotation_Project

官网流程翻阅几遍后,在看如下文档豁然开朗(教程写得很好):
https://biohpc.cornell.edu/doc/annotation_2019_exercises1_v2.html

#The basic steps to any annotation project are as follows:

1.Collect homology evidence datasets 
2.Build RepeatMasker library (可使用RepeatModeler训练产生TE文库,此处更推荐使用EDTA建立TE文库,然后使用RepeatMasker屏蔽重复序列后将屏蔽后的序列输入给maker,运行maker跳过重复注释步骤直接做基因注释)
3.Train ab initio gene predictors
4.Computational annotation (run MAKER)
5.Review/curate annotations
6.Add functional annotations and meta-data
7.Distribute results

 

二、实战

PART1 前期准备

#处理全长转录组数据
samtools view ccs.subreads.bam | awk '{OFS="\t"; print ">" $1 "\n" $10}' - > ccs.subreads.fasta

#数据准备
基因组序列文件:/home/liuxin/input/annotation/Cbp.fasta
全长转录组EST文件:/home/liuxin/input/annotation/cbp.transcripts.fasta
使用下载的同源蛋白序列文件(拟南芥TAR10.1作参考):/home/liuxin/input/annotation/at_protein.faa

#创建重复序列库,EDTA建库更全面
BuildDatabase -name capselladb cbp.fasta
RepeatModeler -pa 48 -database capselladb -LTRStruct >& repeatmodeler.log

 

PART2 初步注释

#配置maker
cd /home/liuxin/input/annotation
maker -CTL
nano maker_opts.ctl
***************************************
genome=cbp.fasta
est=cbp.transcripts.fasta
protein=at_protein.faa

model_org=arabidopsis     #指定拟南芥作近缘种运行repeatmasker
rmlib= # fasta file of your repeat sequence from RepeatModeler. Leave blank to skip.

est2genome=1
protein2genome=1

TMP=/home/liuxin/input/annotation/tmp
***************************************
#maker第一轮运行(运行完成检查log1会看到Makerisnowfinished!!!)
mpiexec -n 48 maker -base cbp_rnd1 >& log1 &

 

PART3 使用SNAP迭代训练两轮

#Snap第一轮训练
cd /home/liuxin/input/annotation
mkdir snap1
cd snap1
gff3_merge -d ../cbp_rnd1.maker.output/cbp_rnd1_master_datastore_index.log
maker2zff -l 50 -x 0.5 cbp_rnd1.all.gff
fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl cbp . > ../cbp1.hmm
mv cbp_rnd1.all.gff ../
cd ..
#备份round1的参数配置
cp maker_opts.ctl maker_opts.ctl_backup_rnd1
#为第一轮训练修改参数
nano maker_opts.ctl
*****************************************
maker_gff=cbp_rnd1.all.gff
est_pass=1 # use est alignment from round 1
protein_pass=1 #use protein alignment from round 1
rm_pass=1 # use repeats in the gff file
snaphmm=cbp1.hmm
est= # remove est file, do not run EST blast again
protein= # remove protein file, do not run blast again
model_org= #remove repeat mask model, so not running RM again
rmlib= # not running repeat masking again
repeat_protein= #not running repeat masking again
est2genome=0 # do not do EST evidence based gene model
protein2genome=0 # do not do protein based gene model.
pred_stats=1 #report AED stats
alt_splice=0 # 0: keep one isoform per gene; 1: identify splicing variants of the same gene
keep_preds=1 # keep genes even without evidence support, set to 0 if no
********************************************
#maker第二轮运行
mpiexec -n 48 maker -base cbp_rnd2 >& log2 &

#Snap第二轮
cd /home/liuxin/input/annotation
mkdir snap2
cd snap2
gff3_merge -d ../cbp_rnd2.maker.output/cbp_rnd2_master_datastore_index.log
maker2zff -l 50 -x 0.5 cbp_rnd2.all.gff
fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl cbp . > ../cbp2.hmm
mv cbp_rnd2.all.gff ..
cd ..
#备份round2的参数配置
cp maker_opts.ctl maker_opts.ctl_backup_rnd2
#为第二轮训练修改参数
nano maker_opts.ctl
******************************************
maker_gff=cbp_rnd2.all.gff
snaphmm=cbp2.hmm
******************************************
#maker第三轮运行
mpiexec -n 48 maker -base cbp_rnd3 >& log3 &
#备份round3的参数配置
cp maker_opts.ctl maker_opts.ctl_backup_rnd3

 

PART4 SNAP训练第三轮,同时启用augustus预测,使用近缘种arabidopsis

#补充SNAP第三轮,生成cbp3.hmm文件
cd /home/liuxin/input/annotation
mkdir snap3
cd snap3
gff3_merge -d ../cbp_rnd3.maker.output/cbp_rnd3_master_datastore_index.log
maker2zff -l 50 -x 0.5 cbp_rnd3.all.gff
fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl cbp . > ../cbp3.hmm
mv cbp_rnd3.all.gff ..

nano maker_opts.ctl
*******************************************
maker_gff=cbp_rnd3.all.gff
snaphmm=cbp3.hmm
augustus_species=arabidopsis
********************************************
#maker第四轮运行
mpiexec -n 48 maker -base cbp_rnd4 >& log4 &

 

PART5 提取最终需要的gff文件以及用到的蛋白和转录本fasta文件

gff3_merge -n -d cbp_rnd4.maker.output/cbp_rnd4_master_datastore_index.log > cbp_rnd4.noseq.gff
fasta_merge -d cbp_rnd4.maker.output/cbp_rnd4_master_datastore_index.log
#最终产生的文件如下
*************************************
cbp_rnd4.noseq.gff
cbp_rnd4.all.maker.proteins.fasta
Cbp_rnd4.all.maker.transcripts.fasta
*************************************

#使用如下命令简化结果文件中的基因名
maker_map_ids --prefix cbp_ --justify 8 --iterate 1 pyu_rnd4.all.gff > id_map
map_gff_ids id_map cbp_rnd4.noseq.gff
map_fasta_ids id_map cbp_rnd4.all.maker.proteins.fasta
map_fasta_ids id_map cbp_rnd4.all.maker.transcripts.fasta

 

PART6 注释基因结构域和功能

#若需注释基因结构域以及基因功能(还需要三个文件output.blastp/output.iprscan/uniprot_sprot.db,同时前两个也需要改id)
blastp -query cbp_rnd4.all.maker.proteins.fasta -db uniprot_sprot.db -evalue 1e-6 -max_hsps 1 -max_target_seqs 1 -outfmt 6 -out output.blastp
interproscan.sh -appl pfam -dp -f TSV -goterms -iprlookup -pa -t p -i cbp_rnd4.all.maker.proteins.fasta -o output.iprscan

map_data_ids id_map output.blastp
map_data_ids id_map output.iprscan

#添加domains注释
ipr_update_gff cbp_rnd4.all.gff output.iprscan > cbp_rnd4.all.2.gff
iprscan2gff3 output.iprscan cbp_rnd4.all.gff >> cbp_rnd4.all.2.gff

#添加functions注释
maker_functional_gff uniprot_sprot.db output.blastp cbp_rnd4.all.2.gff > cbp.all.gff
maker_functional_fasta uniprot_sprot.db output.blastp cbp_rnd4.all.maker.proteins.fasta > cbp.all.maker.proteins.fasta
maker_functional_fasta uniprot_sprot.db output.blastp cbp_rnd4.all.maker.transcripts.fasta > cbp.all.maker.transcripts.fasta

 

posted @ 2022-09-21 23:13  pd_liu  阅读(1437)  评论(1编辑  收藏  举报