【hyperscan】示例解读 simplegrep

示例位置: <hyperscan source>/examples/simplegrep.c
参考：http://01org.github.io/hyperscan/dev-reference/api_files.html

1. 概述

此示例实现一个grep的简化版本：指定一个正则表达式和文件，执行后依次输出匹配位置。

但这个简单示例并不支持从stdin读取数据，也不支持grep那丰富的命令行参数。

simplegrep演示了以下hyperscan概念：

单个模式的编译
使用最简单的hs_compile接口，仅支持一个正则表达式。支持多个表达式同时编译的API是hs_compile_multi
Block方式的模式匹配
在单个数据块上进行搜索匹配；更复杂的是在流(stream)上进行匹配，它可以跨数据块进行模式匹配
临时数据(scratch)的分配与使用
hyperscan在匹配时需要一块临时数据（记为D），调用者需要保证在同一时刻只有一个hs_scan接口使用同一D，但并不要求连续的hs_can调用必须使用同一个D。由于D的分配代价昂贵，为了性能考虑，用户最好在运行前就分配好D并在运行时重用它。

2. 源码解读

这个示例非常简单，这里只解读表达式编译和匹配两部分的代码，读取数据文件等代码忽略。

2.1 编译正则表达式(compile)

进行匹配之前，首先需要编译正则表达式，生成hs_database_t。

    hs_database_t *database;
    hs_compile_error_t *compile_err;
    if (hs_compile(pattern, HS_FLAG_DOTALL, HS_MODE_BLOCK, NULL, &database,
                   &compile_err) != HS_SUCCESS) {
        fprintf(stderr, "ERROR: Unable to compile pattern \"%s\": %s\n",
                pattern, compile_err->message);
        hs_free_compile_error(compile_err);
        return -1;
    }

hs_compile的原型是

hs_error_t hs_compile(const char * expression, 
                      unsigned int flags, 
                      unsigned int mode, 
                      const hs_platform_info_t * platform, 
                      hs_database_t ** db, 
                      hs_compile_error_t ** error)

其中，expression是正则表达式字符串；flags用来控制正则的行为，比如忽略大小写，使.包含换行等；mode确定了生成database的格式，主要有BLOCK，STREAM和VECTOR三种，每一种模式的database只能由相应的scan接口使用；platform用来指定此database的目标平台（主要是一些CPU特性），为NULL表示目标平台与当前平台一致；db用来保存编译后的database；error接收错误信息。

2.2 进行匹配(scan)

首先分配好每次匹配需要用的临时数据(scratch)。

hs_scratch_t *scratch = NULL;
    if (hs_alloc_scratch(database, &scratch) != HS_SUCCESS) {
        fprintf(stderr, "ERROR: Unable to allocate scratch space. Exiting.\n");
        free(inputData);
        hs_free_database(database);
        return -1;
    }

接下来进行匹配(scan）。

if (hs_scan(database, inputData, length, 0, scratch, eventHandler,
                pattern) != HS_SUCCESS) {
        fprintf(stderr, "ERROR: Unable to scan input buffer. Exiting.\n");
        hs_free_scratch(scratch);
        free(inputData);
        hs_free_database(database);
        return -1;
    }

hs_scan的原型是

hs_error_t hs_scan(const hs_database_t * db, 
                   const char * data, 
                   unsigned int length, 
                   unsigned int flags, 
                   hs_scratch_t * scratch, 
                   match_event_handler onEvent, 
                   void * context)

其中，db就是上一步编译的databas；data和length分别是要匹配的数据和数据长度；flags用来在未来版本中控制函数行为，目前未使用；scratch是匹配时要用的临时数据，之前已经分配好；onEvent非常关键，即匹配时调用的回调函数，由用户指定；context是用户自定义指针。

匹配回调函数的原型是

typedef (* match_event_handler)(unsigned int id, 
                                unsigned long long from, 
                                unsigned long long to, 
                                unsigned int flags, 
                                void *context)

其中，id是命中的正则表达式的ID，对于使用hs_compile编译的唯一表达式来说，此值为0；如果在编译时指定了相关模式选项(hs_compile中的mode参数），则此值将会设为匹配特征的起始位置，否则会设为0；to是命中数据的下一个字节的偏移；flags目前未用；context是用户自定义指针。

返回值为非0表示停止匹配，否则继续；在匹配的过程中，每次命中时都将同步调用匹配回调函数，直到匹配结束。

本例中的回调函数是

static int eventHandler(unsigned int id, unsigned long long from,
                        unsigned long long to, unsigned int flags, void *ctx) {
    printf("Match for pattern \"%s\" at offset %llu\n", (char *)ctx, to);
    return 0;
}

输出了正则表达式和其匹配的位置（命中数据的下一个字节在数据中的偏移值）。

2.3 清理资源

程序结束后，应清理相关数据，释放内存。

    hs_free_scratch(scratch);
    free(inputData);
    hs_free_database(database);

3. 编译运行

编译之前，我已经通过make install将hyperscan头文件和静态库安装在了/usr/local相关目录中。

gcc -o simplegrep simplegrep.c -lhs -lstdc++ -lm

注意链接stdc++和math库（lstdc++ -lm)。如果是链接动态库，不需要加-lstdc++ -lm。

运行，在另一示例代码pcapscan.cc中匹配/[f|F]ile/：

./simplegrep '[f|F]ile' pcapscan.cc   
Scanning 22859 bytes with Hyperscan
Match for pattern "[f|F]ile" at offset 1692
.....（略，共45次匹配）

用grep命令验证结果

grep -o '[f|F]ile' pcapscan.cc | wc -l
45

OK，也是45次。

posted @ 2015-10-23 13:35 赵子清阅读(13258) 评论(1) 收藏举报

刷新页面返回顶部

赵子清的技术文章

有限的生命，无尽的知行