1 题目
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
2 思路
刚开始想的要存储s.length()-10个字符串,来比较是否有重复,设置一个hashmap来存储所有的10字母序列,设置一个list来存储有重复的。发现提示内存超了。
后来把hashmap换成另一个list,发现时间超了。
网上一查,发现由于字符串只有A,C,G,T,所以可以把它转化为对应的0,1,2,3,只需两位就够了。这样就由 2 * 16 字节变为 2 *10bit来表达一个字符串。节约了内存空间。
具体参考了http://blog.csdn.net/dddongdong/article/details/43758603,他讲的也更详细一些。个人认为节约的具体空间没有他说的那么多,应该与具体语言、编译器,硬件结构有关,但是肯定是能节约空间的。
3 代码
1 private int myHash(String s){ //java int占4*8=32位,大于10*2=20位,不会溢出,若L大于16,这段代码应该有问题。 2 int n = 0; 3 4 for(int i = 0; i < s.length(); i++){ 5 char c = s.charAt(i); 6 if(c == 'C'){ 7 n += 1; 8 }else if(c == 'G'){ 9 n += 2; 10 }else if(c == 'T'){ 11 n += 3; 12 } 13 n <<=2; //左移两位,用两位表示一个字符 14 } 15 return n; 16 } 17 18 19 public LinkedList<String> findRepeatedDnaSequences(String s){ 20 int L = 10; 21 LinkedList<String> repeatedDnaSequences = new LinkedList<String>(); 22 if(s == null || s.length() < L) return repeatedDnaSequences; 23 24 HashMap<Integer, Boolean> tenLettlesHashMap = new HashMap<Integer, Boolean>(); 25 for (int i = 0; i <= s.length() - L; i++) { 26 String string = s.substring(i, i + L); 27 int dnaSequences = myHash(string); 28 if (tenLettlesHashMap.containsKey(dnaSequences)) { 29 if (!repeatedDnaSequences.contains(string)) {//防止加入重复的string 30 repeatedDnaSequences.add(string); 31 } 32 }else { 33 tenLettlesHashMap.put(dnaSequences, true);//true还是false没关系,主要运用的是查找是否已经存在 34 } 35 } 36 System.out.println(repeatedDnaSequences); 37 return repeatedDnaSequences; 38 }