blast | diamond 输出结果选择和解析 | 比对
之前的文章:构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb
本地运行blast时,需要指定out format。
常见的网页版blast结果可以参照:Blast结果的详细解析
*** Formatting options -outfmt <String> alignment view options: 0 = Pairwise, 1 = Query-anchored showing identities, 2 = Query-anchored no identities, 3 = Flat query-anchored showing identities, 4 = Flat query-anchored no identities, 5 = BLAST XML, 6 = Tabular, 7 = Tabular with comment lines, 8 = Seqalign (Text ASN.1), 9 = Seqalign (Binary ASN.1), 10 = Comma-separated values, 11 = BLAST archive (ASN.1), 12 = Seqalign (JSON), 13 = Multiple-file BLAST JSON, 14 = Multiple-file BLAST XML2, 15 = Single-file BLAST JSON, 16 = Single-file BLAST XML2, 18 = Organism Report Options 6, 7 and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) staxid means Subject Taxonomy ID ssciname means Subject Scientific Name scomname means Subject Common Name sblastname means Subject Blast Name sskingdom means Subject Super Kingdom staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) sscinames means unique Subject Scientific Name(s), separated by a ';' scomnames means unique Subject Common Name(s), separated by a ';' sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP qcovus means Query Coverage Per Unique Subject (blastn only) When not provided, the default value is: 'qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' Default = `0'
默认是0,也就是会输出比对的结果。
但是这样的结果显然不适合批量处理,批量处理的文件格式显然必须是dataframe。
所以网上有人推荐“outfmt 7 or 10 works perfect”,所以一般就选10吧。
Diamond的输出格式:
--outfmt (-f) output format 0 = BLAST pairwise 5 = BLAST XML 6 = BLAST tabular 100 = DIAMOND alignment archive (DAA) 101 = SAM Value 6 may be followed by a space-separated list of these keywords: qseqid means Query Seq - id qlen means Query sequence length sseqid means Subject Seq - id sallseqid means All subject Seq - id(s), separated by a ';' slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive - scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive - scoring matches qframe means Query frame btop means Blast traceback operations(BTOP) staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' qcovhsp means Query Coverage Per HSP qtitle means Query title Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
diamond就选6吧,便于批量处理。
diamond 比对转录本到Pfam库的部分结果,可以看到,格式6非常适合做批量处理。
wgs.RNAseq_00000687 M1AFS7.1 60.9 179 65 2 337 861 102 279 8.1e-55 221.5 wgs.RNAseq_00000687 A0A0B2PIQ0.1 61.0 177 66 2 337 861 319 494 4.0e-54 219.2 wgs.RNAseq_00000687 A0A0L9UG27.1 61.2 178 65 3 337 861 315 491 4.0e-54 219.2 wgs.RNAseq_00000687 A0A0L9UG27.1 63.2 38 14 0 6 119 250 287 1.0e-04 55.1 wgs.RNAseq_00000688 A0A0D2QMP5.1 65.8 114 35 2 218 550 317 429 1.7e-34 153.3 wgs.RNAseq_00000688 A0A0D2QMP5.1 53.2 62 26 1 36 221 256 314 8.7e-10 71.2 wgs.RNAseq_00000688 A0A0D2TR34.1 65.8 114 35 2 218 550 322 434 1.7e-34 153.3 wgs.RNAseq_00000688 A0A0D2TR34.1 53.2 62 26 1 36 221 261 319 8.7e-10 71.2 wgs.RNAseq_00000688 F6H0G0.1 65.8 111 36 2 221 550 320 429 5.6e-33 148.3
每一列是什么呢? BLASTn output format 6
Column headers:qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
1. | qseqid | query (e.g., gene) sequence id |
2. | sseqid | subject (e.g., reference genome) sequence id |
3. | pident | percentage of identical matches |
4. | length | alignment length |
5. | mismatch | number of mismatches |
6. | gapopen | number of gap openings |
7. | qstart | start of alignment in query |
8. | qend | end of alignment in query |
9. | sstart | start of alignment in subject |
10. | send | end of alignment in subject |
11. | evalue | expect value |
12. | bitscore | bit score |
比对结果的第三列和第四列非常有用,尤其是在鉴别软件stringtie等预测出来的转录本是否为有效转录本时,其实预测出来的转录本大部分都是没有意义的,但是又能部分hit到蛋白上,这时我们就只能选出比对最长的那个转录本,其余的可以看作是无效的转录本。
操作比较简单,先对比对长度排个序,从长到短。
cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort
其次就是利用python的panda模块去冗余就好了。
import pandas as pd infile = "extract_no_N_200.fasta.diamond.nr.sort" #infile = "test.sort" df = pd.read_csv(infile, sep="\t", header=None) df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"] df1 = df.sort_values(['l4'],ascending=False) df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first') df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False) df3 = df2.sort_values(['l3'],ascending=False) df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first') df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)
这样我们得到的转录本就基本上是无冗余的了。
然后对每个转录本取identical match比例最高的蛋白就好了。其实信息是可以全部保留的。
blast和diamond怎么只输出最优的hit?
blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1
待续~