Leetcode: Repeated DNA Sequence

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

方法2：进一步的方法是用HashSet, 每次取长度为10的字符串，O(N)时间遍历数组，重复就加入result，但这样需要O(N)的space, 准确说来O(N*10bytes), java而言一个char是2 bytes，所以O(N*20bytes)。String一大就MLE

最优解：是在方法2基础上用bit operation，大概思想是把字符串映射为整数，对整数进行移位以及位与操作，以获取相应的子字符串。众所周知，位操作耗时较少，所以这种方法能节省运算时间。

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下，每10位字符串的组合即为一个数字，且10位的字符串有20位；一般来说int有4个字节，32位，即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

每次向右移动1位字符，相当于字符串对应的int值左移2位，再将其最低2位置为新的字符的编码值，最后将高2位置0。

Cost分析：

时间复杂度O（N）, 而且众所周知，位操作耗时较少，所以这种方法能节省运算时间。

省空间，原来10个char要10 Byte，现在10个char总共20bit，总共O(N*20bits)

空间复杂度：20位的二进制数，至多有2^20种组合，因此HashSet的大小为2^20，即1024 * 1024，O(1)

follow up : 如果是inorder 的话用radix sort

follow up 如果是scanner:

Scanner scanner=new Scanner(System.in);
char a=scanner.nextCharacter();

或 String a=scanner.next();//注意不是nextString()

public static int[] RadixSort(int[] ArrayToSort, int digit)
{
    //low to high digit
    for (int k = 1; k <= digit; k++)
    {
        //temp array to store the sort result inside digit
        int[] tmpArray = new int[ArrayToSort.Length];
 
        //temp array for countingsort
        int[] tmpCountingSortArray = new int[10]{0,0,0,0,0,0,0,0,0,0};
 
        //CountingSort
        for (int i = 0; i < ArrayToSort.Length; i++)
        {
            //split the specified digit from the element
            int tmpSplitDigit = ArrayToSort[i]/(int)Math.Pow(10,k-1) - (ArrayToSort[i]/(int)Math.Pow(10,k))*10;
            tmpCountingSortArray[tmpSplitDigit] += 1; 
        }
 
        for (int m = 1; m < 10; m++)
        {
            tmpCountingSortArray[m] += tmpCountingSortArray[m - 1];
        }
 
        //output the value to result
        for (int n = ArrayToSort.Length - 1; n >= 0; n--)
        {
            int tmpSplitDigit = ArrayToSort[n] / (int)Math.Pow(10,k - 1) - (ArrayToSort[n]/(int)Math.Pow(10,k)) * 10;
            tmpArray[tmpCountingSortArray[tmpSplitDigit]-1] = ArrayToSort[n];
            tmpCountingSortArray[tmpSplitDigit] -= 1;
        }
 
        //copy the digit-inside sort result to source array
        for (int p = 0; p < ArrayToSort.Length; p++)
        {
            ArrayToSort[p] = tmpArray[p];
        }
    }
 
    return ArrayToSort;
}

　As our alphabet A consists of only 4 letters we can be not afraid of collisions. The hash for a current window slice could be found in a constant time by subtracting the former first character　

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        ArrayList<String> res = new ArrayList<String>();
        if (s==null || s.length()<=10) return res;
        HashMap<Character, Integer> dict = new HashMap<Character, Integer>();
        dict.put('A', 0);
        dict.put('C', 1);
        dict.put('G', 2);
        dict.put('T', 3);
        HashSet<Integer> set = new HashSet<Integer>();
        HashSet<String> result = new HashSet<String>(); //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect
        int hashcode = 0;
        for (int i=0; i<s.length(); i++) {
            if (i < 9) {
                hashcode = (hashcode<<2) + dict.get(s.charAt(i));
            }
            else {
                hashcode = (hashcode<<2) + dict.get(s.charAt(i));
                hashcode &= (1<<20) - 1;
                if (!set.contains(hashcode)) {
                    set.add(hashcode);
                }
                else {
                    //duplicate hashcode, decode the hashcode, and add the string to result
                    String temp = s.substring(i-9, i+1);
                    result.add(temp);
                }
            }
        }
        for (String item : result) {
            res.add(item);
        }
        return res;
    }
}

posted @ 2017-12-02 04:48 apanda009 阅读(185) 评论(0) 收藏举报

刷新页面返回顶部

apanda009

Leetcode: Repeated DNA Sequence

公告