1 题目

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

2 思路

刚开始想的要存储s.length()-10个字符串,来比较是否有重复,设置一个hashmap来存储所有的10字母序列,设置一个list来存储有重复的。发现提示内存超了。

后来把hashmap换成另一个list,发现时间超了。

 

网上一查,发现由于字符串只有A,C,G,T,所以可以把它转化为对应的0,1,2,3,只需两位就够了。这样就由 2 * 16 字节变为 2 *10bit来表达一个字符串。节约了内存空间。

具体参考了http://blog.csdn.net/dddongdong/article/details/43758603,他讲的也更详细一些。个人认为节约的具体空间没有他说的那么多,应该与具体语言、编译器,硬件结构有关,但是肯定是能节约空间的。

3 代码

 1     private int myHash(String s){ //java int占4*8=32位,大于10*2=20位,不会溢出,若L大于16,这段代码应该有问题。
 2         int n = 0;  
 3         
 4         for(int i = 0; i < s.length(); i++){            
 5                 char c = s.charAt(i);  
 6                 if(c == 'C'){  
 7                     n += 1;  
 8                 }else if(c == 'G'){  
 9                     n += 2;  
10                 }else if(c == 'T'){  
11                     n += 3;  
12                 }  
13                 n <<=2; //左移两位,用两位表示一个字符
14         }  
15         return n;  
16     }  
17 
18     
19     public LinkedList<String> findRepeatedDnaSequences(String s){
20         int L = 10;
21         LinkedList<String> repeatedDnaSequences = new LinkedList<String>();
22         if(s == null || s.length() < L) return repeatedDnaSequences;
23         
24         HashMap<Integer, Boolean> tenLettlesHashMap = new HashMap<Integer, Boolean>();
25         for (int i = 0; i <= s.length() - L; i++) {
26             String string = s.substring(i, i + L);
27             int dnaSequences = myHash(string);
28             if (tenLettlesHashMap.containsKey(dnaSequences)) {
29                 if (!repeatedDnaSequences.contains(string)) {//防止加入重复的string
30                     repeatedDnaSequences.add(string);
31                 }
32             }else {
33                 tenLettlesHashMap.put(dnaSequences, true);//true还是false没关系,主要运用的是查找是否已经存在
34             }
35         }    
36         System.out.println(repeatedDnaSequences);
37         return repeatedDnaSequences;
38     }