宏基因组实战之:基因预测
紧接前文,各个样本序列组装完毕后,下一步自然是想要弄清楚组装出来的序列中到底是些啥子东西嘛。基因预测方法和工具见这篇推文。
注释工具
常见原核生物基因注释包括 prokka、Prodigal、GeneMark等工具。
这里使用Prodigal对组装的基因组序列进行基因预测:
Prodigal软件参数
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Usage: prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type]
[-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file]
[-p mode] [-q] [-s start_file] [-t training_file] [-v]
-a: Write protein translations to the selected file.
-c: Closed ends. Do not allow genes to run off edges.
-d: Write nucleotide sequences of genes to the selected file.
-f: Select output format (gbk, gff, or sco). Default is gbk.
-g: Specify a translation table to use (default 11).
-h: Print help menu and exit.
-i: Specify FASTA/Genbank input file (default reads from stdin).
-m: Treat runs of N as masked sequence; don't build genes across them.
-n: Bypass Shine-Dalgarno trainer and force a full motif scan.
-o: Specify output file (default writes to stdout).
-p: Select procedure (single or meta). Default is single.
-q: Run quietly (suppress normal stderr output).
-s: Write all potential genes (with scores) to the selected file.
-t: Write a training file (if none exists); otherwise, read and use
the specified training file.
-v: Print version number and exit.
注释参数
prodigal -i final.contigs.fa # 输入组装后的序列文件
-a protein.fa # 输出预测的蛋白序列
-d gene.fa # 输数预测的基因序列
-p meta # 默认单菌样本(single),这里是宏基因组样本
-f gff # 输出注释文件格式为gff
-o contigs.gff # 输出注释结果文件
-s contigs.stat # 注释结果文件统计信息
结果文件
预测基因结果:
预测的蛋白结果:
预测结果gff文件:
预测结果统计信息:
这里再使用Prokka预测看下结果如何
Prokka 软件参数
软件参数,太长了收起来,点击展开
Name:
Prokka 1.12 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
rapid bacterial genome annotation
Usage:
prokka [options] <contigs.fasta>
General:
--help This help
--version Print version and exit
--docs Show full manual/documentation
--citation Print citation for referencing Prokka
--quiet No screen output (default OFF)
--debug Debug mode: keep all temporary files (default OFF)
Setup:
--listdb List all configured databases
--setupdb Index all installed databases
--cleandb Remove all database indices
--depends List all software dependencies
Outputs:
--outdir [X] Output folder [auto] (default '')
--force Force overwriting existing output folder (default OFF)
--prefix [X] Filename output prefix [auto] (default '')
--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
--addmrna Add 'mRNA' features for each 'CDS' feature (default OFF)
--locustag [X] Locus tag prefix [auto] (default '')
--increment [N] Locus tag counter increment (default '1')
--gffver [N] GFF version (default '3')
--compliant Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
--centre [X] Sequencing centre ID. (default '')
--accver [N] Version to put in Genbank file (default '1')
Organism details:
--genus [X] Genus name (default 'Genus')
--species [X] Species name (default 'species')
--strain [X] Strain name (default 'strain')
--plasmid [X] Plasmid name or identifier (default '')
Annotations:
--kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
--gram [X] Gram: -/neg +/pos (default '')
--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
--proteins [X] FASTA or GBK file to use as 1st priority (default '')
--hmms [X] Trusted HMM to first annotate from (default '')
--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
--rawproduct Do not clean up /product annotation (default OFF)
--cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF)
Computation:
--cpus [N] Number of CPUs to use [0=all] (default '8')
--fast Fast mode - only use basic BLASTP databases (default OFF)
--noanno For CDS just set /product="unannotated protein" (default OFF)
--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
--norrna Don't run rRNA search (default OFF)
--notrna Don't run tRNA search (default OFF)
--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
使用前还需安装些perl依赖:cpanm Bio::Perl XML::Simple
prokka --listdb 查看数据库信息
注释参数
prokka --metagenome --prefix prokka --outdir prokka_res final.contigs.fa
作者:天使不设防
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利.