数据结构之Boyer Moore字符串匹配
Pattern Searching | Set 7 (Boyer
Moore Algorithm – Bad Character Heuristic)
Given a text txt[0..n-1] and a
pattern pat[0..m-1], write a function
search(char pat[], char txt[]) that
prints all occurrences of pat[] in txt[].
You may assume that n > m.
Examples:
1) Input:
txt[] = "THIS IS A
TEST TEXT"
pat[] = "TEST"
Output:
Pattern found at index 10
2) Input:
txt[] =
"AABAACAADAABAAABAA"
pat[] = "AABA"
Output:
Pattern found at index 0
Pattern found at index 9
Pattern found at index 13
Pattern searching is an important
problem in computer science.
When we do search for a string in
notepad/word file or browser or database,
pattern searching algorithms are
used to show the search results.
We have discussed the following
algorithms in the previous posts:
Naive Algorithm
KMP Algorithm
Rabin Karp Algorithm
Finite Automata based
Algorithm
In this post, we will discuss Boyer
Moore pattern searching algorithm.
Like KMP and Finite Automata
algorithms, Boyer Moore algorithm also preprocesses the
pattern.
Boyer Moore is a combination of
following two approaches.
1) Bad Character Heuristic
2) Good Suffix Heuristic
Both of the above heuristics can
also be used independently to search a pattern in a text.
Let us first understand how two
independent approaches work together in the Boyer Moore
algorithm.
If we take a look at the Naive
algorithm, it slides the pattern over the text one by one.
KMP algorithm does preprocessing
over the pattern so that the pattern can be shifted by more than
one.
The Boyer Moore algorithm does
preprocessing for the same reason.
It preporcesses the pattern and
creates different arrays for both heuristics.
At every step, it slides the pattern
by max of the slides suggested by the two heuristics.
So it uses best of the two
heuristics at every step.
Unlike the previous pattern
searching algorithms, Boyer Moore algorithm starts matching from
the last character of the pattern.
In this post, we will discuss bad
character heuristic, and discuss Good Suffix heuristic in the next
post.
The idea of bad character heuristic
is simple.
The character of the text which
doesn’t match with the current character of pattern is called the
Bad Character.
Whenever a character doesn’t match,
we slide the pattern in such a way that aligns the bad character
with the last occurrence of it in pattern.
We preprocess the pattern and store
the last occurrence of every possible character in an array of size
equal to alphabet size. If the character is not present at all,
then it may result in a shift by m (length of pattern). Therefore,
the bad character heuristic takes O(n/m) time in the best
case.
# include < limits.h >
# include < string.h >
# include < stdio.h >
# define NO_OF_CHARS 256
// A utility function to get maximum
of two integers
int max (int a, int b) { return (a
> b)? a: b; }
// The preprocessing function for
Boyer Moore's bad character heuristic
void badCharHeuristic( char *str,
int size, int badchar[NO_OF_CHARS])
{
int i;
// Initialize all occurrences as
-1
for (i = 0; i < NO_OF_CHARS;
i++)
badchar[i] = -1;
// Fill the actual value of last
occurrence of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
void Boyer_Moore_Matcher( char *txt,
char *pat)
{
int m = strlen(pat);
int n = strlen(txt);
int badchar[NO_OF_CHARS];
badCharHeuristic(pat, m,
badchar);
int s = 0; // s is
shift of the pattern with respect to text
while(s <= (n - m))
{
int j = m-1;
while(j >= 0 && pat[j] ==
txt[s+j])
j--;
if (j < 0)
{
printf("\n pattern occurs at shift =
%d", s);
s += (s+m < n)?
m-badchar[txt[s+m]] : 1;
}
else
s += max(1, j -
badchar[txt[s+j]]);
//s += max(1, j -
badchar[txt[s+m-1]]);
}
}
int main()
{
char txt[] = "ABAAABCD";
char pat[] = "ABC";
Boyer_Moore_Matcher(txt, pat);
return 0;
}