转：boost regex 学习

转自园子里某位大虾的帖子

一安装

下载并解压Boost（我用的是boost-1.3.2.0）到指定目录比如boost_1_32_0
依次执行下列命令
cd boost_1_32_0_1/tools/build/jam_src //进入bjam编译目录，安装boost需要用bjam
sh ./build.sh //编译生成bjam
cp bin.linuxx86/bjam /bin     //将bjam拷贝到一个环境变量目录下
                                        //（我用bin，也可以指定其他目录，
                                        // 只要能在任何目录下调用到就行）
cd http://www.cnblogs.com/..    //返回boost_1_32_0目录
bjam "-sTOOLS=gcc" "--includedir=/usr/include" "--libdir=/usr/lib/boost" install
    //编译并安装boost,时间很长，过程不需要干预

最后只需要等待编译结果出来就行了。

有的系统下直接运行 ./build.sh 就行了，不用加 sh 来指定用那个shell解释器，但是在我机器上不行。我对shell编程不熟悉，只好加上 sh 了。
哪位网友知道原因的不妨指教一下。
执行./build.sh 的脚本时候可能会出现找不到文件或目录的错误或者是脚本不是可执行文件，请修改build.sh的权限
如 chmod 0777 build.sh
如果出现在函数 echo_run()处执行出错,可以将build.sh文件中的语句
BOOST_JAM_TOOLSET=
改为
BOOST_JAM_TOOLSET=""
在重新运行sh ./build.sh 。具体原因不太清楚。哪位网友知道原因的不妨指教一下。

关于bjam的后面的参数的设置：
-sTOOLS=gcc 指定编译器为GCC
--includedir=/usr/include/ 指定头文件的安装目录，我安装在/usr/include下。如果安装成功，将在/usr/include/生成目录boost_1_32,该目录下就是boost的头文件目录
--libdir=/usr/lib/boost 指定boost的库文件的存放位置,生成的 .a .so 文件将放在该目录下
install 编译并安装boost

二：学习正则表达式
http://www.cppblog.com/Files/shootingstars/deelx_zh.rar
不错的正则表达式的学习资料，顺便推荐一下：
http://www.regexlab.com/
这个站长还与我有个一信之缘（我写的P2P之UDP穿透NAT的原理与实现（附源代码））。站长的这个正则库在CodeProject获得了不错的评价。

三：简单的例子
    std::string regstr = "a+";
    boost::regex expression(regstr);
    std::string testString = "aaa";

    // 匹配至少一个a
    if( boost::regex_match(testString, expression) )
    {
        std::cout<< "Match" << std::endl;
    }
    else
    {
        std::cout<< "Not Match" << std::endl;
    }

四：regex_match例子代码学习
1 我们经常会看一个字符串是不是合法的IP地址，合法的IP地址需要符合以下这个特征：
xxx.xxx.xxx.xxx 其中xxx是不超过255的整数
正则表达式找到上面的这种形式的字符串相当容易，只是判断xxx是否超过255就比较困难了（因为正则表达式是处理的文本，而非数字）
OK，我们先来处理一个数字，即：xxx。找到一种表达式来处理这个数字，并且保证这个数字不会超过255
第一种情况：x，即只有一个数字，它可以是0～9 ，用\d 表示
第二种情况：xx，即有两个数字，它可以是00～99，用\d\d 表示
第三种情况：xxx，这种情况分为两种，一种是 1xx，可以用 1\d\d 表示
另外一种是 2xx，这又分为两种 2[1234]\d
和 25[12345]
好了组合起来
1?\d{1,2}|2[1234]\d|25[12345]
既可以标识一个不大于255的数字字符串

嗯，我们现在需要重复这种情况既可：
(1?\d{1,2}|2[1234]\d|25[12345])\.(1?\d{1,2}|2[1234]\d|25[12345])\.(1?\d{1,2}|2[1234]\d|25[12345])\.(1?\d{1,2}|2[1234]\d|25[12345])

呵呵，长是长了点，我试图用boost支持的子表达式缩短，但是没有达到效果，请各位了解boost的正则表达式的达人指点：
(1?\d{1,2}|2[1234]\d|25[12345])\.\1$\.\1$\.\1$
(参看反向索引：http://www.boost.org/libs/regex/doc/syntax_perl.html
似乎反向只能匹配与第一个字符完全一样的字符串，与我们的需求不同)

Example：

2 我们来看看regex_match的另外一个函数原型
template <class ST, class SA, class Allocator, class charT, class traits>
    bool regex_match(const basic_string<charT, ST, SA>& s,
    match_results<typename basic_string<charT, ST, SA>::const_iterator, Allocator>& m,
    const basic_regex <charT, traits>& e, match_flag_type flags = match_default);
template <class BidirectionalIterator, class Allocator, class charT, class traits>
    bool regex_match(BidirectionalIterator first, BidirectionalIterator last),
    match_results<BidirectionalIterator , Allocator>& m,
    const basic_regex <charT, traits>& e, match_flag_type flags = match_default);

注意参数m，如果这个函数返回false的话，m无定义。如果返回true的话，m的定义如下

(下面的 boost::smatch what也一样，what[0] 表示match 到的整个字符串，what[n] 表示match到的

第 n个 ()  内的字符串)

Element	Value
m.size()	e.mark_count()
m.empty()	false
m.prefix().first	first
m.prefix().last	first
m.prefix().matched	false
m.suffix().first	last
m.suffix().last	last
m.suffix().matched	false
m[0].first	first
m[0].second	last
m[0].matched	`true` if a full match was found, and `false` if it was a partial match (found as a result of the `match_partial` flag being set).
m[n].first	For all integers n < m.size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.
m[n].second	For all integers n < m.size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.
m[n].matched	For all integers n < m.size(), true if sub-expression n participated in the match, false otherwise.

Example:

std::string regstr =
"(1?\\d{1,2}|2[1234]\\d|25[12345])\\.(1?\\d{1,2}|2[1234]\\d|25[12345])\\.(1?\\d{1,2}|2[1234]\\d|25[12345])\\.(1?\\d{1,2}|2[1234]\\d|25[12345])";
boost::regex expression(regstr);
std::string testString = "192.168.4.1";
boost::smatch what;
if( boost::regex_match(testString, what, expression) )
{
    std::cout<< "This is ip address" << std::endl;
    for(int i = 1;i <= 4;i++)
    {
        std::string msg(what[i].first, what[i].second);
        std::cout<< i << "：" << msg.c_str() << std::endl;
    }
}
else
{
    std::cout<< "This is not ip address" << std::endl;
}

这个例子会把所有的IP的单个数字答应出来：
This is ip address
1：192
2：168
3：4
4：1
五：regex_search学习
regex_search与regex_match基本相同，只不过regex_search不要求全部匹配，即部份匹配（查找）即可。
即 regex_match 要求所写的正则表达式对整行格式匹配，而regex_search是查找一行中是否含有某个正则表达式。
简单例子：

std::string regstr = "(\\d+)";
boost::regex expression(regstr);
std::string testString = "192.168.4.1";
boost::smatch what;
if( boost::regex_search(testString, expression) )
{
std::cout<< "Have digit" << std::endl;
}

上面这个例子检测给出的字符串中是否包含数字。

好了，再来一个例子，用于打印出所有的数字

std::string regstr = "(\\d+)";
boost::regex expression(regstr);
std::string testString = "192.168.4.1";
boost::smatch what;
std::string::const_iterator start = testString.begin();
std::string::const_iterator end = testString.end();
while( boost::regex_search(start, end, what, expression) )
{
    std::cout<< "Have digit：" ;
    std::string msg(what[1].first, what[1].second);
    std::cout<< msg.c_str() << std::endl;
    start = what[0].second;
}

打印出：
Have digit：192
Have digit：168
Have digit：4
Have digit：1
六：关于重复的贪婪
我们先来一个例子：

std::string regstr = "(.*)(age)(.*)(\\d{2})";
boost::regex expression(regstr);
std::string testString = "My age is 28 His age is 27";
boost::smatch what;
std::string::const_iterator start = testString.begin();
std::string::const_iterator end = testString.end();
while( boost::regex_search(start, end, what, expression) )
{

    std::string name(what[1].first, what[1].second);
    std::string age(what[4].first, what[4].second);
    std::cout<< "Name:" << name.c_str() << std::endl;
    std::cout<< "Age:" <<age.c_str() << std::endl;
    start = what[0].second;
}

我们希望得到的是打印人名，然后打印年龄。但是效果令我们大失所望：
Name:My age is 28 His
Age:27

嗯，查找原因：这是由于"+"号或者"*"号等重复符号带来的副作用，这些符号会消耗尽可能多的输入，使之是“贪婪”的。即正则表达式(.*)会匹配最长的串，而不是匹配最短的成功串。
如何使得这些重复的符号不再“贪婪”，我们在重复符号后加上"?"即可。

std::string regstr = "(.*?)(age)(.*?)(\\d{2})";
boost::regex expression(regstr);
std::string testString = "My age is 28 His age is 27";
boost::smatch what;
std::string::const_iterator start = testString.begin();
std::string::const_iterator end = testString.end();
while( boost::regex_search(start, end, what, expression) )
{

    std::string name(what[1].first, what[1].second);
    std::string age(what[4].first, what[4].second);
    std::cout<< "Name:" << name.c_str() << std::endl;
    std::cout<< "Age:" <<age.c_str() << std::endl;
    start = what[0].second;
}

打印输出：
Name:My
Age:28
Name: His
Age:27
七：regex_replace学习
写了个去除左侧无效字符（空格，回车，TAB）的正则表达式。

std::string testString = " \r\n Hello World ! GoodBye World\r\n";
std::string TrimLeft = "([\\s\\r\\n\\t]*)(\\w*.*)";
boost::regex expression(TrimLeft);
testString = boost::regex_replace( testString, expression, "$2" );
std::cout<< "TrimLeft:" << testString <<std::endl;

打印输出：
TrimLeft:Hello World ! GoodBye World

去除右侧无效字符的：
s = "Hello World ! GoodBye World\r\n \r\n ";
reg = "((\\w*.*)*)\\b([\\s\\r\\n\\t])*";
std::cout <<"regualar expression is"<<reg<<endl;
std::cout <<"the string before replace:"<<s<<endl;
s = boost::regex_replace(s, reg, "$1");
cout<<"the string after replace:"<<s<<endl;

posted @ 2008-03-04 10:52 owomo 阅读(816) 评论(0) 编辑收藏举报

bioinformatics

转：boost regex 学习