查找fasta文件中每条染色体的gc含量

test.fa

>chr_1

ATCGTCGaaAATGAANccNNttGTA

AGGTCTNAAccAAttGggG

>chr_2

ATCGAATGATCGANNNGccTA

AGGTCTNAAAAGG

>chr_3

ATCGTCGANNNGTAATggGA

AGGTCTNAAAAGG

>chr_4

ATCGTCaaaGANNAATGANGgggTA

python:
 
from collections import OrderedDict
seq = ''
chr_d = OrderedDict()
with open("test.fa","r") as f:
    for line in f:
        #line = line.rstrip()
        line=line.strip()
        if line.startswith(">"):
            chr_d[seq] = ''
        else:
            line = line.upper()
            chr_d[seq] += line
    for name,chr_seq in chr_d.items(): #此方法返回元组对的列表
        #print (name,chr_seq)
        seqLen = len(chr_seq)
        N = chr_seq.count("N")
        GC = chr_seq.count("G") + chr_seq.count("C")
        print (name)
        print (seqLen,N/seqLen,GC/(seqLen-N))
 
result:
>chr_1
44 0.09090909090909091 0.425
>chr_2
34 0.11764705882352941 0.43333333333333335
>chr_3
33 0.12121212121212122 0.4482758620689655
>chr_4
25 0.12 0.4090909090909091
posted @ 2017-01-12 21:25  qinqinyang  阅读(747)  评论(0编辑  收藏  举报