leetcode[187]Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].
class Solution {
public:
/**
 * 所有DNA都是由一系列碱基构成, 分别为ACGT, 题目要求找出所有长度为10的子串, 这些子串在原串中出现次数必须大于1次(重复出现)
 * 思路:
 *     1、暴力枚举肯定是会超时
 *     2、hash
 *        1)unordered_set<string> repeated 存储长度为10的子字符串,遍历字符串,在repeated中查找S[i]~S[i+9]构成的子串:
* 若未查找到,则将其添加到repeated中,若找到,则重复,将其添加到vector<string> res中; * 2)然而unordered_set<string>对于超长的输入串, 会消耗大量的存储空间; * 改进:字符串压缩(10个字符char的子串需要8bit*10=80bit,而A C G T 四个字符需要两位bit编码00 01 10 11,10个char字符需要2bit*10=20bit,1 int=32 bit) * 3)另外还需要考虑res中的重复答案, 因为每次只要出现在repeated中就放入res, 这显然会造成重复放置问题; * 改进:再构造一个unordered_set<int> check, 用于存储已经存入res中的重复子串对应的strInt值; *
*/ vector<string> findRepeatedDnaSequences(string s) { vector<string> res; if(s.empty() || s.size()<10) return res; unordered_map<char, unsigned int> smap = {{'A', 0},{'C', 1},{'G', 2},{'T', 3}}; unordered_set<unsigned int> repeated, check; int strInt = 0; for(int i = 0; i < 10; i++){ strInt = (strInt<<2) + smap[s[i]]; } repeated.insert(strInt); for(int i = 10; i < s.size(); i++ ){ strInt = ((strInt & 0x3ffff)<<2)+smap[s[i]]; if(repeated.find(strInt)==repeated.end()){ repeated.insert(strInt); }else{ if(check.find(strInt) == check.end()){ res.push_back(s.substr(i-9,10)); check.insert(strInt); } } } return res; } };

 

posted @ 2015-08-30 19:26  Vae永Silence  阅读(206)  评论(0编辑  收藏  举报