对测序返还结果进行处理，并进行批量blast

引言

背景介绍

因为实验室用的seqHunter年代太老了，而且最近需要进行很多扩增，测序，所以考虑搭建一个简单的批量化处理流程。

软件

python
Windows下的cmd命令
ncbi的blast+

步骤

blast+的本地化搭建

在ncbi网站上下载blast+（好像blast+的运行需要JAVA环境）
安装
- 新建db文件夹作为数据库文件夹
- input和output作为输入输出文件夹
添加环境变量
- 变量名：BLASTDB ，变量值：\path\ncbi\blast-2.7.1\db
- 变量名：Path，变量值：\path\ncbi\blast-2.7.1\bin
构建数据库
- 数据准备
  - ncbi上可以下载别人上传的
  - 也可以选择把自己的几条需要变成一个fasta文件，从而作为建库文件
- 格式化数据库(这里我假装有把我所有的481同源序列变成了一个fasta文件)

cd path\blast-2.7.1\db # 把当前目录切到db下，这些命令都是在cmd下敲的。    
makeblastdb -in 481.fasta -dbtype nucl -parse_seqids -hash_index -out 481_nucl

参数说明

-in 需要建库的fasta序列  
-dbtype 如果是蛋白库则用prot，核酸库用nucl  
-out 所建数据库的名称  
-parse_seqids 和 -hash_index 一般都加上，解释如下:  
-parse_seqids => Parse Seq-ids in FASTA input         # 我也不懂这是干啥  
-hash_index => Create index of sequence hash values

对公司返还的测序数据进行整理

去除ab1结尾的文件

from path import Path
import os
a = input('input your path')
d = Path(a)
for f in d.walk():   # 遍历你选择的目录下所有的文件名
    if f.endswith('ab1'): # 如果你的文件名后缀是ab1
        os.remove(f) # 移除文件

将后缀名为seq的改为fasta（因为好像要比对的文件必须是fasta格式）

for f in d.walk():
    if f.endswith('seq'):
        portion = os.path.splitext(f) # 这一步会把文件名拆成后缀加文件名的列表
        newname = portion[0] + '.fasta' # 选择其中的文件名，然后加上fasta
        os.rename(f, newname)

利用cmd的for in循环批量进行blast

for %i in (*fasta) do blastn -query %i -db ..\\db\481_nucl -evalue 1e-3 -outfmt 7 -out ..\output\%i.txt  
# 经过实践，最好是cd 到input文件夹下，即%i产生的文件名不要带有路径，不然最后的output无法输出

参数说明

-query 输入文件名，也就是需要比对的序列文件  
-db 格式化后的数据库名称  
-evalue 设定输出结果中的e-value阈值  
-out 输出文件名  
-num_alignments 输出比对上的序列的最大值条目数  
-num_threads 线程数  
此外还有：  
-num_descriptions 对比对上序列的描述信息，一般跟tabular格式连用  
-outfmt      
 0 = pairwise,  
 1 = query-anchored showing identities,  
 2 = query-anchored no identities,  
 3 = flat query-anchored, show identities,  
 4 = flat query-anchored, no identities,  
 5 = XML Blast output,  
 6 = tabular,  
 7 = tabular with comment lines,  
 8 = Text ASN.1,  
 9 = Binary ASN.1  
 10 = Comma-separated values

利用python自动对没有比对上的输出文件进行剔除

利用文件中包含有'0 hits found'

c = Path('F:/科研常用软件/ncbi/blast-2.7.1+/output')
for f in c.walk():
    with open(f) as fh:
        a = fh.read()
        if '0 hits found' in a:
            fh.close()
            # with open 应该是等整个执行完后自动close
            # 但我们这里由于程序在未结束的情况下，就要remove文件
            # 所以会出现程序被占用，无法结束的情况
            # f.close强制关闭文件，然后执行remove
            os.remove(f)

小结

据说有个叫ncbiwww的包可以进行批量化，下次可以试试
blast产生的结果还是要好好研究下

posted @ 2017-12-08 15:13 熵负阅读(3212) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

熵负

对测序返还结果进行处理，并进行批量blast

引言

背景介绍

软件

步骤

blast+的本地化搭建

对公司返还的测序数据进行整理

利用cmd的for in循环批量进行blast

利用python自动对没有比对上的输出文件进行剔除

小结

公告