blast | diamond 输出结果选择和解析 | 比对

之前的文章:构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb

本地运行blast时,需要指定out format。

常见的网页版blast结果可以参照:Blast结果的详细解析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report
 
   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
        qseqid means Query Seq-id
           qgi means Query GI
          qacc means Query accesion
       qaccver means Query accesion.version
          qlen means Query sequence length
        sseqid means Subject Seq-id
     sallseqid means All subject Seq-id(s), separated by a ';'
           sgi means Subject GI
        sallgi means All subject GIs
          sacc means Subject accession
       saccver means Subject accession.version
       sallacc means All subject accessions
          slen means Subject sequence length
        qstart means Start of alignment in query
          qend means End of alignment in query
        sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
        evalue means Expect value
      bitscore means Bit score
         score means Raw score
        length means Alignment length
        pident means Percentage of identical matches
        nident means Number of identical matches
      mismatch means Number of mismatches
      positive means Number of positive-scoring matches
       gapopen means Number of gap openings
          gaps means Total number of gaps
          ppos means Percentage of positive-scoring matches
        frames means Query and subject frames separated by a '/'
        qframe means Query frame
        sframe means Subject frame
          btop means Blast traceback operations (BTOP)
        staxid means Subject Taxonomy ID
      ssciname means Subject Scientific Name
      scomname means Subject Common Name
    sblastname means Subject Blast Name
     sskingdom means Subject Super Kingdom
       staxids means unique Subject Taxonomy ID(s), separated by a ';'
             (in numerical order)
     sscinames means unique Subject Scientific Name(s), separated by a ';'
     scomnames means unique Subject Common Name(s), separated by a ';'
    sblastnames means unique Subject Blast Name(s), separated by a ';'
             (in alphabetical order)
    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
             (in alphabetical order)
        stitle means Subject Title
    salltitles means All Subject Title(s), separated by a '<>'
       sstrand means Subject Strand
         qcovs means Query Coverage Per Subject
       qcovhsp means Query Coverage Per HSP
        qcovus means Query Coverage Per Unique Subject (blastn only)
   When not provided, the default value is:
   'qaccver saccver pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'

默认是0,也就是会输出比对的结果。

但是这样的结果显然不适合批量处理,批量处理的文件格式显然必须是dataframe。

所以网上有人推荐“outfmt 7 or 10 works perfect”,所以一般就选10吧。

Diamond的输出格式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
--outfmt (-f)          output format
    0   = BLAST pairwise
    5   = BLAST XML
    6   = BLAST tabular
    100 = DIAMOND alignment archive (DAA)
    101 = SAM
 
    Value 6 may be followed by a space-separated list of these keywords:
 
    qseqid means Query Seq - id
    qlen means Query sequence length
    sseqid means Subject Seq - id
    sallseqid means All subject Seq - id(s), separated by a ';'
    slen means Subject sequence length
    qstart means Start of alignment in query
    qend means End of alignment in query
    sstart means Start of alignment in subject
    send means End of alignment in subject
    qseq means Aligned part of query sequence
    sseq means Aligned part of subject sequence
    evalue means Expect value
    bitscore means Bit score
    score means Raw score
    length means Alignment length
    pident means Percentage of identical matches
    nident means Number of identical matches
    mismatch means Number of mismatches
    positive means Number of positive - scoring matches
    gapopen means Number of gap openings
    gaps means Total number of gaps
    ppos means Percentage of positive - scoring matches
    qframe means Query frame
    btop means Blast traceback operations(BTOP)
    staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
    stitle means Subject Title
    salltitles means All Subject Title(s), separated by a '<>'
    qcovhsp means Query Coverage Per HSP
    qtitle means Query title
 
    Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

diamond就选6吧,便于批量处理。

 

diamond 比对转录本到Pfam库的部分结果,可以看到,格式6非常适合做批量处理。

1
2
3
4
5
6
7
8
9
wgs.RNAseq_00000687     M1AFS7.1        60.9    179     65      2       337     861     102     279     8.1e-55 221.5
wgs.RNAseq_00000687     A0A0B2PIQ0.1    61.0    177     66      2       337     861     319     494     4.0e-54 219.2
wgs.RNAseq_00000687     A0A0L9UG27.1    61.2    178     65      3       337     861     315     491     4.0e-54 219.2
wgs.RNAseq_00000687     A0A0L9UG27.1    63.2    38      14      0       6       119     250     287     1.0e-04 55.1
wgs.RNAseq_00000688     A0A0D2QMP5.1    65.8    114     35      2       218     550     317     429     1.7e-34 153.3
wgs.RNAseq_00000688     A0A0D2QMP5.1    53.2    62      26      1       36      221     256     314     8.7e-10 71.2
wgs.RNAseq_00000688     A0A0D2TR34.1    65.8    114     35      2       218     550     322     434     1.7e-34 153.3
wgs.RNAseq_00000688     A0A0D2TR34.1    53.2    62      26      1       36      221     261     319     8.7e-10 71.2
wgs.RNAseq_00000688     F6H0G0.1        65.8    111     36      2       221     550     320     429     5.6e-33 148.3  

 

每一列是什么呢? BLASTn output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

 

 1.  qseqid  query (e.g., gene) sequence id
 2.  sseqid  subject (e.g., reference genome) sequence id
 3.  pident  percentage of identical matches
 4.  length  alignment length
 5.  mismatch  number of mismatches
 6.  gapopen  number of gap openings
 7.  qstart  start of alignment in query
 8.  qend  end of alignment in query
 9.  sstart  start of alignment in subject
 10.  send  end of alignment in subject
 11.  evalue  expect value
 12.  bitscore  bit score

 

 

比对结果的第三列和第四列非常有用,尤其是在鉴别软件stringtie等预测出来的转录本是否为有效转录本时,其实预测出来的转录本大部分都是没有意义的,但是又能部分hit到蛋白上,这时我们就只能选出比对最长的那个转录本,其余的可以看作是无效的转录本。

操作比较简单,先对比对长度排个序,从长到短。

1
cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort

其次就是利用python的panda模块去冗余就好了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
 
infile = "extract_no_N_200.fasta.diamond.nr.sort"
#infile = "test.sort"
 
df = pd.read_csv(infile, sep="\t", header=None)
df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"]
 
df1 = df.sort_values(['l4'],ascending=False)
 
df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first')
 
df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False)
 
df3 = df2.sort_values(['l3'],ascending=False)
 
df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first')
 
df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)

  

这样我们得到的转录本就基本上是无冗余的了。

 

然后对每个转录本取identical match比例最高的蛋白就好了。其实信息是可以全部保留的。

blast和diamond怎么只输出最优的hit?

 

1
blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1

  

 

待续~

posted @   Life·Intelligence  阅读(17457)  评论(0编辑  收藏  举报
(评论功能已被禁用)
编辑推荐:
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
TOP
点击右上角即可分享
微信分享提示