187. Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
题目含义:找出给出字符串里面,连续10个字母出现多次的串。
方法一:
import java.util.ArrayList; import java.util.Hashtable; import java.util.List; public class Solution { public List<String> findRepeatedDnaSequences(String s) { List<String> res = new ArrayList<>(); Hashtable<String,Integer> temp = new Hashtable<>(); for(int i=0;i<s.length()-9;i++){ //将每一个长度为10的子字符串进行遍历,没有就将其放进hashtable里面,有且现在之出现了一次就添加进结果里面。 String subString = s.substring(i,i+10); if(temp.containsKey(subString)){ int count=temp.get(subString); //如果为1,则添加进结果,否则继续遍历 if(count==1){ temp.remove(subString); temp.put(subString,2); res.add(subString); } }else{ temp.put(subString,1); } } return res; } }
方法二:
1 public List<String> findRepeatedDnaSequences(String s) { 2 // http://www.cnblogs.com/grandyang/p/4284205.html 3 // A: 0100 0001 C: 0100 0011 G: 0100 0111 T: 0101 0100,目的是利用位来区分字符,当然是越少位越好,通过观察发现,每个字符的后三位都不相同, 4 // 故而我们可以用末尾三位来区分这四个字符。而题目要求是10个字符长度的串,每个字符用三位来区分,10个字符需要30位,在32位机上也OK。 5 // 为了提取出后30位,我们还需要用个mask,取值为0x7ffffff,用此mask可取出后27位,再向左平移三位即可。 6 // 算法的思想是,当取出第十个字符时,将其存在Set里,之后每向左移三位替换一个字符,查找新cur在set中是否出现, 7 // 如果之前刚好出现过一次,则将当前字符串(从i-10到i)存入返回值的数组,如果从未出现过,则将cur添加到set中。 8 Set<String> res = new HashSet<>(); 9 if (s.length()<=10) return new ArrayList<>(); 10 int mask = 0x7ffffff; 11 Set<Integer> set = new HashSet<>(); 12 int cur=0,i=0; 13 while (i<9) 14 { 15 cur = (cur<<3)|(s.charAt(i++)&7); 16 } 17 while (i<s.length()) 18 { 19 cur = ((cur & mask) << 3) | (s.charAt(i++) & 7); 20 if (set.contains(cur)) 21 { 22 res.add(s.substring(i-10,i)); 23 } 24 else set.add(cur); 25 } 26 return new ArrayList<>(res) ; 27 }