DNA sequence(映射+BFS)
Problem Description
The twenty-first century is a biology-technology developing century. We know that a gene is made of DNA. The nucleotide bases from which DNA is built are A(adenine), C(cytosine), G(guanine), and T(thymine). Finding the longest common subsequence between DNA/Protein sequences is one of the basic problems in modern computational molecular biology. But this problem is a little different. Given several DNA sequences, you are asked to make a shortest sequence from them so that each of the given sequence is the subsequence of it.
For example, given "ACGT","ATGC","CGTT" and "CAGT", you can make a sequence in the following way. It is the shortest but may be not the only one.
For example, given "ACGT","ATGC","CGTT" and "CAGT", you can make a sequence in the following way. It is the shortest but may be not the only one.
Input
The first line is the test case number t. Then t test cases follow. In each case, the first line is an integer n ( 1<=n<=8 ) represents number of the DNA sequences. The following k lines contain the k sequences, one per line. Assuming that the length of any sequence is between 1 and 5.
Output
For each test case, print a line containing the length of the shortest sequence that can be made from these sequences.
SampleInput
1 4 ACGT ATGC CGTT CAGT
SampleOutput
8
题意就是给你几个DNA序列,要求找到一个序列,使得所有序列都是它的子序列(不一定连续)。
直接搜MLE、TLE、RE,所以不能直接搜索,一般处理这种序列问题,都是把序列映射到整数或其他便于处理的东西上。
题目还说了每个DNA的序列长度不会超过5,所以我们可以按位处理映射到一个整数上,而且题目只需要我们输出最短的序列长度,所以我们也不必去映射字符,映射长度便够了。
最多8个字符,每个字符1-5长度,所以最大数为6^8。好为什么是6^8,不明明是5^8么,这个我暂时先不解释,我加在了代码注释里。
代码:
1 #include <iostream> 2 #include <string> 3 #include <cstdio> 4 #include <cstdlib> 5 #include <sstream> 6 #include <iomanip> 7 #include <map> 8 #include <stack> 9 #include <deque> 10 #include <queue> 11 #include <vector> 12 #include <set> 13 #include <list> 14 #include <cstring> 15 #include <cctype> 16 #include <algorithm> 17 #include <iterator> 18 #include <cmath> 19 #include <bitset> 20 #include <ctime> 21 #include <fstream> 22 #include <limits.h> 23 #include <numeric> 24 25 using namespace std; 26 27 #define F first 28 #define S second 29 #define mian main 30 #define ture true 31 32 #define MAXN 1000000+5 33 #define MOD 1000000007 34 #define PI (acos(-1.0)) 35 #define EPS 1e-6 36 #define MMT(s) memset(s, 0, sizeof s) 37 typedef unsigned long long ull; 38 typedef long long ll; 39 typedef double db; 40 typedef long double ldb; 41 typedef stringstream sstm; 42 const int INF = 0x3f3f3f3f; 43 44 int t,n; 45 map<int,int>vis; 46 char s[10][10]; //保存序列 47 int len[10]; //保存每个序列的长度 48 int p[10] = {1,6,36,216,1296,7776,46656,279936,1679616,10077696}; //6的k次方表 49 char temp[4]={'A','C','G','T'}; 50 51 struct node{ 52 int step; //长度 53 int st; //也就是映射数 54 node(){} 55 node(int _step, int _st):step(_step),st(_st){} 56 }; 57 58 int bfs(int res){ 59 vis.clear(); 60 queue<node>q; 61 q.push(node(0,0)); 62 vis[0] = 1; 63 while(!q.empty()){ 64 node nxt,k = q.front(); 65 q.pop(); 66 if(k.st == res){ //当映射等于结果时 返回长度 67 return k.step; 68 } 69 for(int i = 0; i < 4; i++){ 70 nxt.st = 0; 71 nxt.step = k.step+1; 72 int tp = k.st; 73 for(int j = 1; j <= n; j++){ 74 int x = tp%6; //得到位数 75 tp /= 6; 76 if(x == len[j] || s[j][x+1] != temp[i]){ //判断字符是否匹配 77 nxt.st += x*p[j-1]; 78 } 79 else{ 80 nxt.st += (x+1)*p[j-1]; 81 } 82 } 83 if(vis[nxt.st] == 0){ //标记是否已经搜过 84 q.push(nxt); 85 vis[nxt.st] = 1; 86 } 87 } 88 } 89 } 90 91 int main(){ 92 ios_base::sync_with_stdio(false); 93 cout.tie(0); 94 cin.tie(0); 95 cin>>t; 96 while(t--){ 97 cin>>n; 98 int res = 0; 99 for(int i = 1; i <= n; i++){ //因为数组从0开始计数,但我们映射以及后面操作都是基于位置,所以从1开始 100 cin>>s[i]+1; //同理从一开始 101 len[i] = strlen(s[i]+1); 102 res += len[i]*p[i-1]; //这也就是为什么是6^8,因为我们是从1开始有5个状态而不是0 103 } 104 cout << bfs(res) <<endl; 105 } 106 return 0; 107 }
所以这题你非要从0位置搞,弄5^8确实没错,也可以做出来,但是操作会繁琐很多,还不如从方便的角度多加一个长度。
这道题的难度就是不知道怎么入手,即使知道转换处理也不知道该如何转换以及如何搜索,这里我们避免了去从字符开始搜索,而是直接基于长度搜。
值得一提的是,我问了队友后,他们表示这道题做法很多,还可以用IDA*算法或者启发式搜索,甚至不用搜索用AC自动机加矩阵也可以做。但这些做法都是基于字符去搜索的,也不能说谁好谁坏,只是我们的思维就不一样了,很多题目其实都不止一种解法,多想想,很有用的。至于其他做法我也就懒得做了(其实是不会23333)