16 Finding a Protein Motif
Problem
To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.
You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into
http://www.uniprot.org/uniprot/uniprot_id
Alternatively, you can obtain a protein sequence in FASTA format by following
http://www.uniprot.org/uniprot/uniprot_id.fasta
For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.
Given: At most 15 UniProt Protein Database access IDs.
Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.
Sample Dataset
A2Z669 B5ZC00 P07204_TRBM_HUMAN P20840_SAG1_YEAST
Sample Output
B5ZC00 85 118 142 306 395 P07204_TRBM_HUMAN 47 115 116 382 409 P20840_SAG1_YEAST 79 109 135 248 306 348 364 402 485 501 614
#coding=utf-8 import urllib2 import re list = ['A2Z669','B5ZC00','P07204_TRBM_HUMAN','P20840_SAG1_YEAST'] for one in list: name = one.strip('\n') url = 'http://www.uniprot.org/uniprot/'+name+'.fasta' req = urllib2.Request(url) response = urllib2.urlopen(req) the_page = response.read() start = the_page.find('\nM') seq = the_page[start+1:].replace('\n','') seq = ' '+seq regex = re.compile(r'N(?=[^P][ST][^P])') index = 0 out = [] ''' out = [m.start() for m in re.finditer(regex, seq)] ''' index = 0 while(index<len(seq)): index += 1 if re.search(regex,seq[index:]) == None: break #print S[index:] if re.match(regex,seq[index:]) != None: out.append(index) if out != []: print name print ' '.join([ str(i) for i in out])