在一个很长的字符串中搜索自定义字符串的问题(通过多线程实现)
问题:
今日的一些高阶程式语言对于字串的处理支援越来越强大(例如Java、Perl等),不过字串搜寻本身仍是个值得探讨的课题.
今天看到这个题目,我在想,能不能通过多线程的形式,一个线程用于搜索位置,主线程用于比对,这样子同一时刻在做两件事,效率就提高了,我们来看下代码:
#include <iostream> #include <time.h> #include <windows.h> using namespace std; #define MAXSIZE 1690 #define Search_Size 3 //上面定义搜索的长度,下面定义搜索的字符 char SH[Search_Size] = "ab"; char A[MAXSIZE] = "gbjaqpcrpsabwojubabfaksurthawpgbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvewoorcuglknwbvewoorcuglknwbvewoorcuglknwbveiqkkmivntbsccamcioudltwoorcuglknwbbnehabfmpwjiuaabdnmqwlhmsrnkpgikwnbalabrneiskakevincbpcbabfopheerqlacqbtgilpakihtneabeveqannqnugonqabvrtasnhiqfacntwwq";//定义的字符串长度 int posit[MAXSIZE];//保存线程搜索到的位置 short int thread_state = 0;//标识线程状态,假如是0标识没有开始运行,1标识还在运行,2标识已经运行完毕 //字符串的初始化 void Initialize_A() { srand(time(0)); for (int i =0;i <= MAXSIZE-2;i++) { A[i] = rand()%23 + 97; } } //输出字符串 void Print_A() { for (int i = 0;i <= MAXSIZE-1;i++) { cout<<A[i]; } cout<<endl; } //检查是否有字符串 bool check(char *a) { bool temp = true; for (int i = 1;i <= Search_Size-2;i++) { if (*(a+i) != SH[i]) { temp = false; } } return temp; } //线程,用于搜索字符数组中与关键字首字母相同的位置 DWORD WINAPI psearch_fun(LPVOID l) { thread_state = 1;//线程状态置为已经在运行 int count_posit = 0; for (int i = 0;i <= MAXSIZE-1;i++) { if (A[i] == SH[0]) { posit[count_posit] = i; count_posit++; } } thread_state = 2; return 0; } int main() { //数组元素置为-1,及放置的是无效位置 for (int i = 0;i <= MAXSIZE-1;i++) { posit[i] = -1; } //Initialize_A(); Print_A(); //用于计时 clock_t begin,end; begin = clock(); HANDLE psearch = CreateThread(NULL,0,psearch_fun,NULL,0,NULL); i = 0; //这里线程需要等待一会儿,不然会因为满足不了循环条件立即退出 //Sleep(100); //现在不需要,因为运用了一个int类型来表征了线程运行的三种状态,不像以前的两种 while(!(posit[i] == -1 && thread_state == 2)) { //如果程序还在运行或者还没开始运行(也就是线程状态处于0或者1)但是当前元素位置为-1,即无效,那么先暂停主线程,等待搜索线程 while(posit[i] == -1 && thread_state != 2) { Sleep(100); } //进行当前位置的条件判断 if (check(A+posit[i])) { cout<<posit[i]<<"满足"<<endl; } i++; } end = clock(); cout<<"总共花费了"<<(double)(end - begin) / CLK_TCK<<"秒"<<endl; cout<<"下面是线程搜索到的位置"<<endl; i = 0; while(posit[i] != -1 && i <= MAXSIZE-1) { cout<<posit[i]<<" "; i++; } return 0; }
这里通过全局变量的形式进行线程同步,比如:用thread_state标识线程运行状态,让线程对posit数组进行操作(这里可以通过在创建线程的时候所传递的第四个参数,将posit的地址传递过去类似于类中的静态成员函数是线程入口函数的时候传递this指针一样),
主要思想就是这些,要注意的是主线程和幅线程同步的问题,代码里面有说明
下面看1690个字符所运行的结果:花费了0.125秒,性能不是太好,接下来我们会分析下原因所在(这个跟Boyer-Moore算法还是差的太多了)
下面我们再来看单线程实现:
#include <iostream> #include <time.h> #include <windows.h> using namespace std; #define MAXSIZE 1690 #define Search_Size 3 //上面定义搜索的长度,下面定义搜索的字符 char SH[Search_Size] = "ab"; char A[MAXSIZE] = "gbjaqpcrpsabwojubabfaksurthawpgbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvegbjaqpcrpsabwojubabfaksurthawpogktabbbhpfaktejekmdgrkqktlquojauaceabffnphfgshaklnjeiqkkmivnjlqtbsccamcioudltwoorcuglknwbvewoorcuglknwbvewoorcuglknwbvewoorcuglknwbveiqkkmivntbsccamcioudltwoorcuglknwbbnehabfmpwjiuaabdnmqwlhmsrnkpgikwnbalabrneiskakevincbpcbabfopheerqlacqbtgilpakihtneabeveqannqnugonqabvrtasnhiqfacntwwq";//定义的字符串长度 int posit[MAXSIZE];//保存线程搜索到的位置 short int thread_state = 0;//标识线程状态,假如是0标识没有开始运行,1标识还在运行,2标识已经运行完毕 //字符串的初始化 void Initialize_A() { srand(time(0)); for (int i =0;i <= MAXSIZE-2;i++) { A[i] = rand()%23 + 97; } } //输出字符串 void Print_A() { for (int i = 0;i <= MAXSIZE-1;i++) { cout<<A[i]; } cout<<endl; } //检查是否有字符串 bool check(char *a) { bool temp = true; for (int i = 1;i <= Search_Size-2;i++) { if (*(a+i) != SH[i]) { temp = false; } } return temp; } //线程,用于搜索字符数组中与关键字首字母相同的位置 DWORD WINAPI psearch_fun(LPVOID l) { thread_state = 1;//线程状态置为已经在运行 int count_posit = 0; for (int i = 0;i <= MAXSIZE-1;i++) { if (A[i] == SH[0]) { posit[count_posit] = i; count_posit++; } } thread_state = 2; return 0; } int main() { //数组元素置为-1,及放置的是无效位置 for (int i = 0;i <= MAXSIZE-1;i++) { posit[i] = -1; } //Initialize_A(); Print_A(); //用于计时 clock_t begin,end; begin = clock(); for (i = 0;i <= MAXSIZE-2;i++) { if (A[i] == SH[0] && check(A+i)) { cout<<i<<" "; } } end = clock(); cout<<"满足"<<endl; cout<<"总共花费了"<<(double)(end - begin) / CLK_TCK<<"秒"<<endl; cout<<"下面是线程搜索到的位置"<<endl; i = 0; while(posit[i] != -1 && i <= MAXSIZE-1) { cout<<posit[i]<<" "; i++; } return 0; }
只是在多线程的基础上进行的修改,所以还很多无效的代码,无关紧要,主要是对比单线程和多线程之间的区别,下面是运行结果:
我们发现这里比上面用的时间还要短,我们猜测是多线程中,主线程等待的时候sleep函数的问题,我们将
while(posit[i] == -1 && thread_state != 2) { //Sleep(100); }
里面的sleep函数注释掉,再看下运行结果:
哈哈,一下子缩短了好多的有没有,这里因为写程序的时候对字符串数组赋值太麻烦,几千个字符电脑上都写不完,只能先放这么多,但是我想,对于单线程和多线程之间的区别就是减少了中间无效项的遍历时间,这样对于大数据的检索核对时间应该会有比较明显的提升.其实这点来看还不是很明显,下面我们创建两个线程,对已经得到的posit进行检索,这样子对于性能的提升应该特别明显了.今天就暂时先做到这里吧!