leetcode[187]Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

class Solution {
public:
/**
 * 所有DNA都是由一系列碱基构成, 分别为ACGT, 题目要求找出所有长度为10的子串, 这些子串在原串中出现次数必须大于1次(重复出现)
 * 思路：
 *     1、暴力枚举肯定是会超时
 *     2、hash
 *        1)unordered_set<string> repeated 存储长度为10的子字符串，遍历字符串，在repeated中查找S[i]~S[i+9]构成的子串:
 *          若未查找到，则将其添加到repeated中，若找到，则重复，将其添加到vector<string> res中;
 *        2)然而unordered_set<string>对于超长的输入串, 会消耗大量的存储空间; 
 *          改进：字符串压缩（10个字符char的子串需要8bit*10=80bit，而A C G T 四个字符需要两位bit编码00 01 10 11，10个char字符需要2bit*10=20bit，1 int=32 bit）
 *        3)另外还需要考虑res中的重复答案, 因为每次只要出现在repeated中就放入res, 这显然会造成重复放置问题;
 *          改进：再构造一个unordered_set<int> check, 用于存储已经存入res中的重复子串对应的strInt值;
 * 
*/
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> res;
        if(s.empty() || s.size()<10) return res;
        unordered_map<char, unsigned int> smap = {{'A', 0},{'C', 1},{'G', 2},{'T', 3}}; 
        unordered_set<unsigned int> repeated, check;
        int strInt = 0;
        for(int i = 0; i < 10; i++){
            strInt = (strInt<<2) + smap[s[i]];
        }
        repeated.insert(strInt);
        for(int i = 10; i < s.size(); i++ ){
            strInt = ((strInt & 0x3ffff)<<2)+smap[s[i]];
            if(repeated.find(strInt)==repeated.end()){
                repeated.insert(strInt);
            }else{
                if(check.find(strInt) == check.end()){
                    res.push_back(s.substr(i-9,10));
                    check.insert(strInt);
                }
            }
        }
        return res;
    }
};

posted @ 2015-08-30 19:26 Vae永Silence 阅读(206) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Vae永Silence

leetcode[187]Repeated DNA Sequences

公告