Augustus安装及使用
conda安装的augustus总是存在问题,使用apt安装到全局目录下更省事。
http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/training.html#meta
https://github.com/Gaius-Augustus/Augustus
https://biohpc.cornell.edu/doc/annotation_2019_exercises1_v2.html
本文使用augustus做基因预测模型训练。
一、安装
##法一
Install dependencies(root用户)
apt-get update
apt-get install build-essential wget git autoconf
# Install dependencies for AUGUSTUS comparative gene prediction mode (CGP)
apt-get install libgsl-dev libboost-all-dev libsuitesparse-dev liblpsolve55-dev
apt-get install libsqlite3-dev libmysql++-dev
# Install dependencies for the optional support of gzip compressed input files
apt-get install libboost-iostreams-dev zlib1g-dev
# Install dependencies for bam2hints and filterBam
apt-get install libbamtools-dev zlib1g-dev
# Install additional dependencies for bam2wig
apt-get install samtools libhts-dev
# Install additional dependencies for homGeneMapping and utrrnaseq
apt-get install libboost-all-dev
# Install additional dependencies for scripts
apt-get install cdbfasta diamond-aligner libfile-which-perl libparallel-forkmanager-perl libyaml-perl libdbd-mysql-perl
apt-get install --no-install-recommends python3-biopython
git clone https://github.com/Gaius-Augustus/Augustus.git
tar -xzf augustus.current.tar.gz
make augustus
##法二(推荐)
sudo apt install augustus augustus-data augustus-doc
###需要的执行文件以及scripts见如下目录
/usr/bin/
/usr/share/augustus/scripts
/usr/share/augustus/config
二、数据准备
###使用的数据集来自maker_tutorials包,本次使用数据如下
pyu_contig.fasta
pyu_est.fasta
sp_protein.fasta
#对上述文件执行maker第一轮获得如下文件,作为augustus的初始训练集
pyu_rnd1.all.gff
三、环境配置
###每次跑augustus的训练流程,需要先完成相关的配置,全局安装的情况下会找不到scripts文件,需要引入环境变量。
#In the same screen session, set up Augustus environment.
cp -r /usr/share/augustus/config/ ~/output/augustustest/augustus_config
export LD_LIBRARY_PATH=/programs/boost_1_62_0/lib
export AUGUSTUS_CONFIG_PATH=~/output/augustustest/augustus1/augustus_config
export LD_LIBRARY_PATH=/programs/boost_1_62_0/lib
export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8
export PATH=/usr/bin/:/usr/share/augustus/scripts:$PATH
四、使用Augustus开始训练
#The following commands will convert the MAKER round 1 results to input files for building a SNAP mode.
mkdir augustus1
cd augustus1
gff3_merge -d ../pyu_rnd1.maker.output/pyu_rnd1_master_datastore_index.log
#After this step, you will see a new gff file pyu_rnd1.all.gff from round 1.
## filter gff file, only keep maker annotation in the filtered gff file
awk '{if ($2=="maker") print }' pyu_rnd1.all.gff > maker_rnd1.gff
##convert the maker gff and fasta file into a Genbank formated file named pyu.gb
##We keep 2000 bp up- and down-stream of each gene for training the models
gff2gbSmallDNA.pl maker_rnd1.gff pyu_contig.fasta 2000 pyu.gb
## check number of genes in training set
grep -c LOCUS pyu.gb
## train model
## first create a new Augustus species named
new_species.pl --species=pyu
## initial training
etraining --species=pyu pyu.gb
## the initial model should be in the directory
ls -ort $AUGUSTUS_CONFIG_PATH/species/pyu
##create a smaller test set for evaluation before and after optimization. Name the evaluation set pyu.gb.evaluation.
randomSplit.pl pyu.gb 200
mv pyu.gb.test pyu.gb.evaluation
# use the first model to predict the genes in the test set, and check the results
augustus --species=pyu pyu.gb.evaluation >& first_evaluate.out
grep -A 22 Evaluation first_evaluate.out
# optimize the model. this step is very time consuming. It could take days. To speed things up, you can create a smaller test set
# the following step will create a test and training sets. the test set has 1000 genes. This test set will be splitted into 24 kfolds for optimization (the kfold can be set up to 48, with processed with one cpu core per kfold. Kfold must be same number as as cpus). The training, prediction and evaluation will be performed on each bucket in parallel (training on hh.gb.train+each bucket, then comparing each bucket with the union of the rest). By default, 5 rounds of optimization. As optimization for large genome could take days, I changed it to 3 here.
randomSplit.pl pyu.gb 300
optimize_augustus.pl --species=pyu --kfold=24 --cpus=24 --rounds=3 --onlytrain=pyu.gb.train pyu.gb.test >& log &
#train again after optimization
etraining --species=pyu pyu.gb
# use the optionized model to evaluate again, and check the results
augustus --species=pyu pyu.gb.evaluation >& second_evaluate.out
grep -A 22 Evaluation second_evaluate.out
After these steps, the species model is in the directory ~/output/augustustest/augustus_config/species/pyu
. 用于maker中augustus的输入。
五、启用maker
# Now modify the following values in the file: maker_opts.ctl
maker_gff= pyu_rnd1.all.gff
est_pass=1 # use est alignment from round 1
protein_pass=1 #use protein alignment from round 1
rm_pass=1 # use repeats in the gff file
augustus_species=~/output/augustustest/augustus_config/species/pyu # augustus species model you just built
est= # remove est file, do not run EST blast again
protein= # remove protein file, do not run blast again
model_org= #remove repeat mask model, so not running RM again
rmlib= # not running repeat masking again
repeat_protein= #not running repeat masking again
est2genome=0 # do not do EST evidence based gene model
protein2genome=0 # do not do protein based gene model.
pred_stats=1 #report AED stats
alt_splice=0 # 0: keep one isoform per gene; 1: identify splicing variants of the same gene
keep_preds=1 # keep genes even without evidence support, set to 0 if no
# Run maker with the new augustus model(若使用/usr/bin/augustus,需要配置augustus,同时修改maker_exe.ctl中的相关设置,主要还是因为maker装的环境混了,有空回来重装)
conda activate maker3
maker -base pyu_rnd2 >& pyu_rnd2.log
# Create gff and fasta output files:
# Use the following command to create the final merged gff file. The “-n” option would produce a gff file without genome sequences:
gff3_merge -n -d pyu_rnd2.maker.output/pyu_rnd2_master_datastore_index.log>pyu_rnd2.noseq.gff
fasta_merge -d pyu_rnd2.maker.output/pyu_rnd2_master_datastore_index.log
# 运行上述步骤后,获取得到*.noseq.gff
, protein
and transcript fasta
文件
#使用如下命令简化基因id
maker_map_ids --prefix pyu_ --justify 8 --iterate 1 pyu_rnd2.all.gff > id_map
map_gff_ids id_map pyu_rnd2.all.gff
map_fasta_ids id_map pyu_rnd2.all.maker.proteins.fasta
map_fasta_ids id_map pyu_rnd2.all.maker.transcripts.fasta