字符串是否包含问题 - RunningSnail

公告

题目描述：
假设这有一个各种字母组成的字符串A，和另外一个字符串B，字符串里B的字母数相对少一些。什么方法能最快的查出所有小字符串B里的字母在大字符串A里都有？

比如，如果是下面两个字符串：
String 1: ABCDEFGHLMNOPQRS
String 2: DCGSRQPO
答案是true，所有在string2里的字母string1也都有。
  
如果是下面两个字符串：  
String 1: ABCDEFGHLMNOPQRS   
String 2: DCGSRQPZ  
答案是false，因为第二个字符串里的Z字母不在第一个字符串里。

文章源自：程序员编程艺术：第二章、字符串是否包含问题

1）暴力轮询

判断string2中的字符是否在string1中?：
String 1: ABCDEFGHLMNOPQRS
String 2: DCGSRQPO

判断一个字符串是否在另一个字符串中，最直观也是最简单的思路是，针对第二个字符串string2中每一个字符，一一与第一个字符串string1中每个字符依次轮询比较，看它是否在第一个字符串string1中。

int CompareString(string LongString,string ShortString)  
{  
    int i,j;
    for (i=0; i<ShortString.length(); i++)  
    {  
        for (j=0; j<LongString.length(); j++)  //O(n*m)  
        {  
            if (LongString[j] == ShortString[i])  //一一比较  
            {  
                break;  
            }  

        }  
        if (j==LongString.length())  
        {  
            cout << "false" << endl;  
            return 0;  
        }  
    }  
    cout << "true" << endl;  
    return 1;  
}

假设n是字符串string1的长度，m是字符串string2的长度，那么此算法，需要O（n*m）次操作，拿上面的例子来说，最坏的情况下将会有16*8 = 128次操作。显然，时间开销太大，我们需要找到一种更好的办法。

2）排序方法

先对这两个字符串的字母进行排序，然后同时对两个字串依次轮询。两个字串的排序需要(常规情况)O(m log m) + O(n log n)次操作，之后的线性扫描需要O(m+n)次操作。

同样拿上面的字串做例子，将会需要16*4 + 8*3 = 88，再加上对两个字串线性扫描的16 + 8 = 24的操作。

int partition(string &str,int lo,int hi) 
{
    int key = str[hi];        //以最后一个元素，data[hi]为主元
    int i = lo - 1;
    for(int j = lo; j < hi; j++)    ///注，j从p指向的是r-1，不是r。
    {
        if(str[j] <= key)
        {
            i++;
            swap(str[i], str[j]);
        }
    }
    swap(str[i+1], str[hi]);    //不能改为swap(&data[i+1],&key)
    return i + 1; 
}

//递归调用上述partition过程，完成排序。
void quicksort(string &str, int lo, int hi)
{
    if (lo < hi)
    {
        int k = partition(str, lo, hi);
        quicksort(str, lo, k - 1);
        quicksort(str, k + 1, hi);
    }
}

//比较，上述排序O(m log m) + O(n log n)，加上下面的O(m+n)，
//时间复杂度总计为：O(mlogm)+O(nlogn)+O(m+n)。
void compare(string str1,string str2)
{
    int posOne = 0;
    int posTwo = 0;
    while (posTwo < str2.length() && posOne < str1.length())
    {
        while (str1[posOne] < str2[posTwo] && posOne < str1.length() - 1)
            posOne++;
        //如果和str2相等，那就不能动。只有比str2小，才能动。
        
        if (str1[posOne] != str2[posTwo])
            break;
        
        //posOne++;   
        //归并的时候，str1[str1Pos] == str[str2Pos]的时候，只能str2Pos++,str1Pos不可以自增。

        posTwo++;
    }
                
    if (posTwo == str2.length())
        cout << "true" << endl;
    else
        cout << "false" << endl;
}

3）计数排序

此方案与上述思路相比，就是在排序的时候采用线性时间的计数排序方法，排序O（n+m），线性扫描O（n+m），总计时间复杂度为：O（n+m）+O（n+m）=O（n+m）。

void CounterSort(string str, string &help_str)
{
    // 辅助计数数组
    int help[26] = {0};

    // help[index]存放了等于index + 'A'的元素个数
    for (int i = 0; i < str.length(); i++)
    {
        int index = str[i] - 'A';
        help[index]++;
    }

    // 求出每个元素对应的最终位置
    for (int j = 1; j < 26; j++)
        help[j] += help[j-1];

    // 把每个元素放到其对应的最终位置
    for (int k = str.length() - 1; k >= 0; k--)
    {
        int index = str[k] - 'A';
        int pos = help[index] - 1;
        help_str[pos] = str[k];
        help[index]--;
    }
}

//线性扫描O（n+m）
void Compare(string long_str,string short_str)
{
    int pos_long = 0;
    int pos_short = 0;
    while (pos_short < short_str.length() && pos_long < long_str.length())
    {
        // 如果pos_long递增直到long_str[pos_long] >= short_str[pos_short]
        while (long_str[pos_long] < short_str[pos_short] && pos_long < long_str.length

() - 1)
            pos_long++;
        
        // 如果short_str有连续重复的字符，pos_short递增
        while (short_str[pos_short] == short_str[pos_short+1])
            pos_short++;

        if (long_str[pos_long] != short_str[pos_short])
            break;
        
        pos_long++;
        pos_short++;
    }
    
    if (pos_short == short_str.length())
        cout << "true" << endl;
    else
        cout << "false" << endl;
}

4）hashtable的方法

把其中的每个字母都放入一个Hashtable里(我们始终设m为短字符串的长度，那么此项操作成本是O(m)或8次操作)。然后轮询长字符串，在Hashtable里查询短字符串的每个字符，看能否找到。如果找不到，说明没有匹配成功，轮询长字符串将消耗掉16次操作，这样两项操作加起来一共只有8+16=24次。
当然，理想情况是如果长字串的前缀就为短字串，只需消耗8次操作，这样总共只需8+8=16次。

hash[26]，先全部清零，然后扫描短的字符串，若有相应的置1，
计算hash[26]中1的个数，记为m
扫描长字符串的每个字符a；若原来hash[a] == 1 ，则修改hash[a] = 0，并将m减1；若hash[a] == 0，则不做处理
若m == 0 or 扫描结束，退出循环。

int main()
{
    string str1="ABCDEFGHLMNOPQRS";
    string str2="DCGSRQPOM";

    // 开辟一个辅助数组并清零
    int hash[26] = {0};

    // num为辅助数组中元素个数
    int num = 0;

    // 扫描短字符串
    for (int j = 0; j < str2.length(); j++)
    {
        // 将字符转换成对应辅助数组中的索引
        int index = str1[j] - 'A';

        // 如果辅助数组中该索引对应元素为0，则置1，且num++;
        if (hash[index] == 0)
        {
            hash[index] = 1;
            num++;
        }
    }

    // 扫描长字符串
    for (int k = 0; k < str1.length(); k++)
    {
        int index = str1[k] - 'A';

        // 如果辅助数组中该索引对应元素为1，则num--;为零的话，不作处理（不写语句）。
        if(hash[index] ==1)
        {
            hash[index] = 0;
            num--;
            if(num == 0)    //m==0，即退出循环。
                break;
        }
    }

    // num为0说明长字符串包含短字符串内所有字符
    if (num == 0)
        cout << "true" << endl;
    else
        cout << "false" << endl;
    return 0;
}

5）O（n）到O（n+m）的素数方法

假设我们有一个一定个数的字母组成字串，我给每个字母分配一个素数，从2开始，往后类推。这样A将会是2，B将会是3，C将会是5，等等。现在我遍历第一个字串，把每个字母代表的素数相乘。你最终会得到一个很大的整数，对吧？
然后——轮询第二个字符串，用每个字母除它。如果除的结果有余数，这说明有不匹配的字母。如果整个过程中没有余数，你应该知道它是第一个字串恰好的子集了。

思路总结如下：
1.定义最小的26个素数分别与字符'A'到'Z'对应。
2.遍历长字符串，求得每个字符对应素数的乘积。
3.遍历短字符串，判断乘积能否被短字符串中的字符对应的素数整除。
4.输出结果。

至此，如上所述，上述算法的时间复杂度为O(m+n)，时间复杂度最好的情况为O(n)（遍历短的字符串的第一个数，与长字符串素数的乘积相除，即出现余数，便可退出程序，返回false），n为长字串的长度，空间复杂度为O(1)。如你所见，我们已经优化到了最好的程度。

// 素数数组
int primeNumber[26] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
                        61, 67, 71, 73, 79, 83, 89, 97, 101};

int main()
{
    string strOne = "ABCDEFGHLMNOPQRS";
    string strTwo = "DCGSRQPOM";

    // 这里需要用到大整数
    CBigInt product = 1;   //大整数除法的代码，下头给出。

    // 遍历长字符串，得到每个字符对应素数的乘积
    for (int i = 0; i < strOne.length(); i++)
    {
        int index = strOne[i] - 'A';
        product = product * primeNumber[index];
    }

    // 遍历短字符串
    for (int j = 0; j < strTwo.length(); j++)
    {
        int index = strTwo[j] - 'A';

        // 如果余数不为0，说明不包括短字串中的字符，跳出循环
        if (product % primeNumber[index] != 0)
            break;
    }

    // 如果积能整除短字符串中所有字符则输出"true"，否则输出"false"。
    if (strTwo.length() == j)
        cout << "true" << endl;
    else
        cout << "false" << endl;
    return 0;
}

6）用32位整数中的低26位

bool AcontainsB(char *A,char *B) {  
    int have = 0;  
    while (*B) {  
        have |= 1 << (*(B++) - 'A');   // 把A..Z对应为0..26  
    }  
    while (*A) {  
        if ((have & (1 << (*(A++) - 'A'))) == 0) {  
            return false;  
        }  
    }  
    return true;  
}

posted on 2016-03-04 10:27 RunningSnail 阅读(865) 评论(0) 编辑收藏举报

努力加载评论中...

刷新页面返回顶部

写代码是一种艺术，甚于蒙娜丽莎的微笑！

公告