根据list调取特定基因的fasta序列

fasta序列:

>ORF type:5prime_partial len:117 (+) 2T_between1k2k_c100001_f2p26_1105:1-351(+)
CCCCAGGACATGAAGGGTGCCTCTCGAAGCCCCGAAGACAGCAGTCCGGATGCCGCCCGC
ATCCGAGTCAAGCGCTACCGCCAGAGCATGAACAACTTCCAGGGCCTCCGGAGCTTTGGC
TGCCGCTTCGGGACGTGCACGGTGCAGAAGCTGGCACACCAGATCTACCAGTTCACAGAT
AAGGACAAGGACAACGTCGCCCCCAGGAGCAAGATCAGCCCCCAGGGCTACGGCCGCCGG
CGCCGGCGCTCCCTGCCCGAGGCCGGCCCGGGTCGGACTCTGGTGTCTTCTAAGCCACAA
GCACACGGGGCTCCAGCCCCCCCGAGTGGAAGTGCTCCCCACTTTCTTTAG

tmp3.txt:

PBfusion.1 2T_between5k10k_c30267_f1p4_5060
PBfusion.1 2T_between5k10k_c123520_f1p4_5838
PBfusion.2 2T_between5k10k_c163523_f1p1_4968
PBfusion.3 2T_between5k10k_c153833_f1p86_5871
PBfusion.4 2T_between3k6k_c71197_f1p41_3951
PBfusion.5 2T_between5k10k_c121317_f1p276_6597
PBfusion.6 2T_between3k6k_c38282_f1p39_5894
PBfusion.7 2T_between1k2k_c73839_f1p13_1888

需要将tmp3中第2列的基因的序列调取出来,并合并第一列的编号:

f2 = open('tmp3.txt','r')
f1 = open('2T_transcripts.cds.fa','r')
f3 = open('fasta_parsed.txt','w')

import re

AI_DICT = {}

for line in f2:
    id = re.split('\t',line)
    AI_DICT[id[1][0:-1]] = id[0]


skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = re.split(':| ',line)
        accessorIDWithArrow = _splitline[6]
        accessorID = accessorIDWithArrow
        #print (accessorID)
        if accessorID in AI_DICT.keys():
            f3.write(AI_DICT[accessorID]+"\n")
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

  

  

posted @ 2017-11-27 14:47  qinqinyang  阅读(481)  评论(0编辑  收藏  举报