统计 fasta 文件序列长度及 GC 含量
注:该脚本适用于序列不断开的情况
可用一下命令将折行的序列合并为一行
awk '/^>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' < assembly.fasta | egrep -v '^$' | tr "\t" "\n" > assembly.fas
运行脚本
python script.py assembly.fas
import sys
def count_bases(seq):
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
return counts
def gc_content(seq):
counts = count_bases(seq)
return len(seq), (counts["G"] + counts["C"]) / float(len(seq))
out_file = open(sys.argv[2], 'w')
for line in open(sys.argv[1]):
line = line.rstrip()
if line.startswith(">"):
print >> out_file, line.strip(">"),
else:
print >> out_file, "%s %s" % gc_content(line)
out_file.close()
升级版,输入文件是 fasta 格式即可。用 Bio 中的 Seq.IO 解析 fasta 文件,
用 python 的内置函数 count() 的计算速度更快。
import sys
from Bio import SeqIO
out_file = open(sys.argv[2], 'w')
for rec in SeqIO.parse(open(sys.argv[1]), 'fasta'):
print >> out_file, rec.id, len(rec.seq), (rec.seq.count("C") + rec.seq.count("G")) / float(len(rec.seq))
out_file.close()