Python处理多行文本问题--一个简单方法读取多行fasta文件
在处理fasta序列时,常常会遇到一条序列多行排列的现象,如下所示:
$cat test.fasta >test_1 TGGGGAATCTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCT TTCAGCTGGGAAGATAATGACGGTACCAGCAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGG GGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTGTTAAGTCGGGGGTGAAATCCCGGGGCTCAA CCCCGGAACTGCCTCCGATACTGGCAATCTTGAGATCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTA GATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAA CAGG >test_3 TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGCGGGTTGTAAAGCACT TTCAGTAGAGAAGAAATGCCCATGGTTAATACCCGTGGGTCTTGACGTAACCTACAGAAGAAGCACCGGCTAACTCCGTG CCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGGTCAG TCGGATGTGAAAGCCCTAGGCTCAACCTGGGAATGGCATTCGATACTGCCTGACTAGAGTATGGTAGAGGGAAGTGGAAT TTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGACTTCCTGGGCCAATACTGACGCT GAGGTGCGAAAGCGTGGGGAGCAAACAGG >test_4 TGGGGAATTTTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGAGGATGAAGGCCCTCGGGTCGTAAACTCCT GTCCTAGGGGAAGAAAAAAATGACGGTACCCTTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGG GGGGGGGGAGCGGTGTTCGGAATTACTGGGCGTAAAGGGCGCGCAGGCGGCCTGGGAAGTCTTGGGTGAAAGCCCCCAGC TCAACTGGGGAATGGCCTGAGAAACCACTAGGCTGGAGTGCTGGAGAGGGAAGCGGAATTCCCGGTGGAGCGGTGAAATG CGTAGATATCGGGAGGAACACCAGAGGCGAAGGCGGCTTCCTGGACAGACACTGACGCTGAGGCGCGAAAGCTAGGGGAG CAAACGGG >test_5 TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCT TTCACCGGTGAAGATAATGACGGTAACCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGG GGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTATTAAGTCAGGGGTGAAATCCCGGGGCTCAA CCCCGGAACTGCCTTTGATACTGGTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTA GATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAA CAGG >test_6 GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACACCGCGTGCGGGATGAAGGCCCTCGGGTTGTAAACCGCTTTC AGGAGGGACGAAAATGACGGTACCTCCAGAAGAAGGCCCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGCC AAACGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTCAACAAGTCGATCGTGAAAGCCCGGGGCTCAACCC CGGGACGCCGGTCGAAACTGTTGTGACTAGGGTCCGGTAGAGGTGAGTGGAATTCTCGGTGTAGCGGTGGAATGCGCAGA TATCGAGAGGAACACCAGTTGCGAAGGCGGCTCACTGGGCCGGTACCGACGCTAAGGAGCGAAAGCGTGGGGAGCAAACA GG
我的一个简单处理方法是,【整体读入-->分隔符分割为列表-->字符串合并列表】,代码如下:
seq_file=open("test.fasta") seq_list=seq_file.read().split(">") for seq in seq_list : if seq : seq_name=seq.split("\n")[0] seq_fa="".join(seq.split("\n")[1:]) print ">" + seq_name + "\n" + seq_fa
打印结果为:
>test_1 TGGGGAATCTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCAGCTGGGAAGATAATGACGGTACCAGCAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTGTTAAGTCGGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTCCGATACTGGCAATCTTGAGATCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG >test_3 TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGCGGGTTGTAAAGCACTTTCAGTAGAGAAGAAATGCCCATGGTTAATACCCGTGGGTCTTGACGTAACCTACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGGTCAGTCGGATGTGAAAGCCCTAGGCTCAACCTGGGAATGGCATTCGATACTGCCTGACTAGAGTATGGTAGAGGGAAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGACTTCCTGGGCCAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG >test_4 TGGGGAATTTTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGAGGATGAAGGCCCTCGGGTCGTAAACTCCTGTCCTAGGGGAAGAAAAAAATGACGGTACCCTTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGGGGAGCGGTGTTCGGAATTACTGGGCGTAAAGGGCGCGCAGGCGGCCTGGGAAGTCTTGGGTGAAAGCCCCCAGCTCAACTGGGGAATGGCCTGAGAAACCACTAGGCTGGAGTGCTGGAGAGGGAAGCGGAATTCCCGGTGGAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGAGGCGAAGGCGGCTTCCTGGACAGACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACGGG >test_5 TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCACCGGTGAAGATAATGACGGTAACCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTATTAAGTCAGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTTTGATACTGGTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG >test_6 GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACACCGCGTGCGGGATGAAGGCCCTCGGGTTGTAAACCGCTTTCAGGAGGGACGAAAATGACGGTACCTCCAGAAGAAGGCCCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGCCAAACGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTCAACAAGTCGATCGTGAAAGCCCGGGGCTCAACCCCGGGACGCCGGTCGAAACTGTTGTGACTAGGGTCCGGTAGAGGTGAGTGGAATTCTCGGTGTAGCGGTGGAATGCGCAGATATCGAGAGGAACACCAGTTGCGAAGGCGGCTCACTGGGCCGGTACCGACGCTAAGGAGCGAAAGCGTGGGGAGCAAACAGG