10 Consensus and Profile
Problem
A matrix is a rectangular table of values divided into rows and columns. An m×nm×n matrix has mm rows and nn columns. Given a matrix AA, we write Ai,jAi,j to indicate the value found at the intersection of row ii and column jj.
Say that we have a collection of DNA strings, all having the same length nn. Their profile matrix is a 4×n4×n matrix PP in which P1,jP1,j represents the number of times that 'A' occurs in the jjth position of one of the strings, P2,jP2,j represents the number of times that C occurs in the jjth position, and so on (see below).
A consensus string cc is a string of length nn formed from our collection by taking the most common symbol at each position; the jjth symbol of cc therefore corresponds to the symbol having the maximum value in the jj-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
A T C C A G C T | |
G G G C A A C T | |
A T G G A T C T | |
DNA Strings | A A G C A A C C |
T T G G A A C T | |
A T G C C A T T | |
A T G G C A C T | |
A 5 1 0 0 5 5 0 0 | |
Profile | C 0 0 1 4 2 0 6 1 |
G 1 1 6 3 0 1 0 0 | |
T 1 5 0 0 0 1 1 6 | |
Consensus | A T G C A A C T |
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
Sample Dataset
>Rosalind_1 ATCCAGCT >Rosalind_2 GGGCAACT >Rosalind_3 ATGGATCT >Rosalind_4 AAGCAACC >Rosalind_5 TTGGAACT >Rosalind_6 ATGCCATT >Rosalind_7 ATGGCACT
Sample Output
ATGCAACT A: 5 1 0 0 5 5 0 0 C: 0 0 1 4 2 0 6 1 G: 1 1 6 3 0 1 0 0 T: 1 5 0 0 0 1 1 6
方法一:
#-*-coding:UTF-8-*- ### 10. Consensus and Profile ### from collections import Counter from collections import OrderedDict fh = open('consesus_and_profile_output2.txt', 'wt') rosalind = OrderedDict() seqLength = 0 with open('Sample Dataset.txt') as f: for line in f: line = line.rstrip() if line.startswith('>'): rosalindName = line rosalind[rosalindName] = '' continue rosalind[rosalindName] += line seqLength = len(rosalind[rosalindName]) #len(ATCCAGCT) A, C, G, T = [], [], [], [] consensusSeq = '' for i in range(seqLength): seq = '' for k in rosalind.keys(): # rosalind.keys = Rosalind_1...Rosalind_7 seq += rosalind[k][i] # 把 Rosalind_1...Rosalind_7 相同顺序的碱基加起来 A.append(seq.count('A')) C.append(seq.count('C')) G.append(seq.count('G')) T.append(seq.count('T')) counts = Counter(seq) # 为seq计数 consensusSeq += counts.most_common()[0][0] #从多到少返回,是个list啊,只返回第一个 fh.write(consensusSeq + '\n') fh.write('\n'.join(['A:\t' + '\t'.join(map(str, A)), 'C:\t' + '\t'.join(map(str, C)), 'G:\t' + '\t'.join(map(str, G)), 'T:\t' + '\t'.join(map(str, T))])) #.join(map(str,A)) 把 A=[5, 1, 0, 0, 5, 5, 0, 0] 格式转化成 A:5 1 0 0 5 5 0 0 fh.close()
方法二:
# coding=utf-8 import numpy as np import os from collections import Counter fhand = open('Sample Dataset.txt') t = [] for line in fhand: if line.startswith('>'): continue else: line = line.rstrip() line_list = list(line) #变成一个list t.append(line_list) # 再把list 放入list a = np.array(t) # 创建一个二维数组 L1, L2, L3, L4 = [], [], [], [] comsquence = '' for i in range(a.shape[1]): # shape[0],是7 行数,shape[1]是8 项目数 l = [x[i] for x in a] # 调出二维数组的每一列 L1.append(l.count('A')) L2.append(l.count('C')) L3.append(l.count('T')) L4.append(l.count('G')) comsquence = comsquence + Counter(l).most_common()[0][0] print comsquence print 'A:', L1, '\n', 'T:', L2, '\n', 'C:', L3, '\n', 'G:', L4