列出给出序列的crispr 引物

手动寻找cripsr 引物比较麻烦,而现在有些网站可以完成这一任务,但是,用python 去实现它也很简单。以下是脚本:

 1 #!/usr/bin/python
 2 # list all crispr-target(20 bp + NGG)
 3 
 4 import re
 5 from Bio.Seq import reverse_complement   # use Biopython module
 6 
 7 genome_seq = open("seq.txt")
 8 crispr_list = open("crispr_list.txt", "a")    # 'a' 指的是 append
 9 
10 sequence = ""
11 for line in genome_seq:
12     sequence = sequence + line.rstrip("\n")  # use str.rstrip() method
13 
14 def GC_number(seq):
15     num_G = seq.count('G')
16     num_C = seq.count('C')
17     GC_number = num_G + num_C
18     return GC_number
19 
20 for i in range(0, len(sequence)):
21     target = sequence[i:i+23]
22     
23     if re.search(r'[ATCG]{21}GG$', target):    # use regular expression
24     start = i + 1
25     end   = i + 20
26     target = "+ " + target + " from " + str(start) + " to " + str(end) +  "---" + "GC_number = " +  str(GC_number(target))
27     crispr_list.write(target + "\n")     # add '\n'
28     #print(target)
29 
30 RC_sequence = reverse_complement(sequence)   # 调用上面的 reverse_complement() method
31 
32 for i in range(0, len(RC_sequence)):
33     target = RC_sequence[i:i+23]
34     
35     if re.search(r'[ATCG]{21}GG$', target):
36     start = i + 1
37     end   = i + 20
38     target = "- " + target + " from " + str(start) + " to " + str(end) +  "---" + "GC_number = " +  str(GC_number(target))
39     crispr_list.write(target + "\n")
40     #print(target)
41 
42 genome_seq.close()
43 crispr_list.close()

理论上,这正则表达式更直接和简单,但是以下代码的结果远远少于上面方法的结果:

1 dna = sequence
2 runs = re.findall(r'[ATCG]{21}GG', dna)
3 for match in runs:
4     #print(str(match))

目前还不是很清楚原因。

 

posted on 2015-09-14 23:19  OA_maque  阅读(657)  评论(0编辑  收藏  举报

导航