blast | diamond 输出结果选择和解析 | 比对

之前的文章:构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb

本地运行blast时,需要指定out format。

常见的网页版blast结果可以参照:Blast结果的详细解析

*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report

   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
   	    qseqid means Query Seq-id
   	       qgi means Query GI
   	      qacc means Query accesion
   	   qaccver means Query accesion.version
   	      qlen means Query sequence length
   	    sseqid means Subject Seq-id
   	 sallseqid means All subject Seq-id(s), separated by a ';'
   	       sgi means Subject GI
   	    sallgi means All subject GIs
   	      sacc means Subject accession
   	   saccver means Subject accession.version
   	   sallacc means All subject accessions
   	      slen means Subject sequence length
   	    qstart means Start of alignment in query
   	      qend means End of alignment in query
   	    sstart means Start of alignment in subject
   	      send means End of alignment in subject
   	      qseq means Aligned part of query sequence
   	      sseq means Aligned part of subject sequence
   	    evalue means Expect value
   	  bitscore means Bit score
   	     score means Raw score
   	    length means Alignment length
   	    pident means Percentage of identical matches
   	    nident means Number of identical matches
   	  mismatch means Number of mismatches
   	  positive means Number of positive-scoring matches
   	   gapopen means Number of gap openings
   	      gaps means Total number of gaps
   	      ppos means Percentage of positive-scoring matches
   	    frames means Query and subject frames separated by a '/'
   	    qframe means Query frame
   	    sframe means Subject frame
   	      btop means Blast traceback operations (BTOP)
   	    staxid means Subject Taxonomy ID
   	  ssciname means Subject Scientific Name
   	  scomname means Subject Common Name
   	sblastname means Subject Blast Name
   	 sskingdom means Subject Super Kingdom
   	   staxids means unique Subject Taxonomy ID(s), separated by a ';'
   			 (in numerical order)
   	 sscinames means unique Subject Scientific Name(s), separated by a ';'
   	 scomnames means unique Subject Common Name(s), separated by a ';'
   	sblastnames means unique Subject Blast Name(s), separated by a ';'
   			 (in alphabetical order)
   	sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
   			 (in alphabetical order)
   	    stitle means Subject Title
   	salltitles means All Subject Title(s), separated by a '<>'
   	   sstrand means Subject Strand
   	     qcovs means Query Coverage Per Subject
   	   qcovhsp means Query Coverage Per HSP
   	    qcovus means Query Coverage Per Unique Subject (blastn only)
   When not provided, the default value is:
   'qaccver saccver pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'

默认是0,也就是会输出比对的结果。

但是这样的结果显然不适合批量处理,批量处理的文件格式显然必须是dataframe。

所以网上有人推荐“outfmt 7 or 10 works perfect”,所以一般就选10吧。

Diamond的输出格式:

--outfmt (-f)          output format
	0   = BLAST pairwise
	5   = BLAST XML
	6   = BLAST tabular
	100 = DIAMOND alignment archive (DAA)
	101 = SAM

	Value 6 may be followed by a space-separated list of these keywords:

	qseqid means Query Seq - id
	qlen means Query sequence length
	sseqid means Subject Seq - id
	sallseqid means All subject Seq - id(s), separated by a ';'
	slen means Subject sequence length
	qstart means Start of alignment in query
	qend means End of alignment in query
	sstart means Start of alignment in subject
	send means End of alignment in subject
	qseq means Aligned part of query sequence
	sseq means Aligned part of subject sequence
	evalue means Expect value
	bitscore means Bit score
	score means Raw score
	length means Alignment length
	pident means Percentage of identical matches
	nident means Number of identical matches
	mismatch means Number of mismatches
	positive means Number of positive - scoring matches
	gapopen means Number of gap openings
	gaps means Total number of gaps
	ppos means Percentage of positive - scoring matches
	qframe means Query frame
	btop means Blast traceback operations(BTOP)
	staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
	stitle means Subject Title
	salltitles means All Subject Title(s), separated by a '<>'
	qcovhsp means Query Coverage Per HSP
	qtitle means Query title

	Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

diamond就选6吧,便于批量处理。

 

diamond 比对转录本到Pfam库的部分结果,可以看到,格式6非常适合做批量处理。

wgs.RNAseq_00000687     M1AFS7.1        60.9    179     65      2       337     861     102     279     8.1e-55 221.5
wgs.RNAseq_00000687     A0A0B2PIQ0.1    61.0    177     66      2       337     861     319     494     4.0e-54 219.2
wgs.RNAseq_00000687     A0A0L9UG27.1    61.2    178     65      3       337     861     315     491     4.0e-54 219.2
wgs.RNAseq_00000687     A0A0L9UG27.1    63.2    38      14      0       6       119     250     287     1.0e-04 55.1
wgs.RNAseq_00000688     A0A0D2QMP5.1    65.8    114     35      2       218     550     317     429     1.7e-34 153.3
wgs.RNAseq_00000688     A0A0D2QMP5.1    53.2    62      26      1       36      221     256     314     8.7e-10 71.2
wgs.RNAseq_00000688     A0A0D2TR34.1    65.8    114     35      2       218     550     322     434     1.7e-34 153.3
wgs.RNAseq_00000688     A0A0D2TR34.1    53.2    62      26      1       36      221     261     319     8.7e-10 71.2
wgs.RNAseq_00000688     F6H0G0.1        65.8    111     36      2       221     550     320     429     5.6e-33 148.3  

 

每一列是什么呢? BLASTn output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

 

 1.  qseqid  query (e.g., gene) sequence id
 2.  sseqid  subject (e.g., reference genome) sequence id
 3.  pident  percentage of identical matches
 4.  length  alignment length
 5.  mismatch  number of mismatches
 6.  gapopen  number of gap openings
 7.  qstart  start of alignment in query
 8.  qend  end of alignment in query
 9.  sstart  start of alignment in subject
 10.  send  end of alignment in subject
 11.  evalue  expect value
 12.  bitscore  bit score

 

 

比对结果的第三列和第四列非常有用,尤其是在鉴别软件stringtie等预测出来的转录本是否为有效转录本时,其实预测出来的转录本大部分都是没有意义的,但是又能部分hit到蛋白上,这时我们就只能选出比对最长的那个转录本,其余的可以看作是无效的转录本。

操作比较简单,先对比对长度排个序,从长到短。

cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort

其次就是利用python的panda模块去冗余就好了。

import pandas as pd

infile = "extract_no_N_200.fasta.diamond.nr.sort"
#infile = "test.sort"

df = pd.read_csv(infile, sep="\t", header=None)
df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"]

df1 = df.sort_values(['l4'],ascending=False)

df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first')

df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False)

df3 = df2.sort_values(['l3'],ascending=False)

df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first')

df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)

  

这样我们得到的转录本就基本上是无冗余的了。

 

然后对每个转录本取identical match比例最高的蛋白就好了。其实信息是可以全部保留的。

blast和diamond怎么只输出最优的hit?

 

blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1

  

 

待续~

posted @ 2018-03-19 16:53  Life·Intelligence  阅读(17252)  评论(0编辑  收藏  举报
TOP