gnu libc正则式匹配使用文档

Regular Expression Matching(正则表达式匹配)

The GNU C Library supports two interfaces for matching regular expressions. One is the standard POSIX.2 interface, and the other is what the GNU C Library has had for many years.

Both interfaces are declared in the header file regex.h. If you define _POSIX_C_SOURCE, then only the POSIX.2 functions, structures, and constants are declared.

GNU C库支持两种正则式匹配接口。一种是标准的POSIX.2接口,另外一种是GNU C库多年来就已经有的。

两种接口都声明在头文件regex.h。如果你定义了_POSIX_C_SOURCE,这样只有POSIX.2接口相关的函数,结构和常量被定义。

 


 

POSIX Regular Expression Compilation(POSIX正则表达式编译)

Before you can actually match a regular expression, you must compile it. This is not true compilation—it produces a special data structure, not machine instructions. But it is like ordinary compilation in that its purpose is to enable you to “execute” the pattern fast. (See Matching POSIX Regexps, for how to use the compiled regular expression for matching.)

在你真正执行正则表达式匹配之前,你必须先定义它。这不是真正意义上的编译-它产生一个特定的数据结构,而不是机器指令。但为了使你执行正则表达式匹配得更快,它通常都会预先编译。

 

There is a special data type for compiled regular expressions:

这是为了编译正则表达式的一个特定的数据类型

 

— Data Type: regex_t

This type of object holds a compiled regular expression. It is actually a structure. It has just one field that your programs should look at:

这个类型的对象保存着一个编译后的正则表达式。它通过是一个结构体。它只有一个你程序应该查看的字段: 

re_nsub
This field holds the number of parenthetical subexpressions in the regular expression that was compiled.
这个字段保存了被编译的正则式有多少个子表达式。

There are several other fields, but we don't describe them here, because only the functions in the library should use them.

还有其它各种字段,但我们不在这里描述它们,因为它们只会被库里面的函数使用到。

 

After you create a regex_t object, you can compile a regular expression into it by calling regcomp.

当你创建了一个regex_t对象后,你能通过调用regcomp编译一个正则表达式进去。

 

— Function: int regcomp (regex_t *restrict compiled, const char *restrict pattern, int cflags)

The function regcomp “compiles” a regular expression into a data structure that you can use with regexec to match against a string. The compiled regular expression format is designed for efficient matching. regcomp stores it into *compiled.

函数regcomp编译一个正则表达式到一个数据结构,你能用这个数据结构调用regexec去匹配一个字符串。设计编译后的正式表达式格式是为了更快地进行匹配。regcomp保存它*compiled指针里。

 

It's up to you to allocate an object of type regex_t and pass its address to regcomp.

申请一个regex_t类型的对象并把它的地址传到regcomp是你的任务。

 

The argument cflags lets you specify various options that control the syntax and semantics of regular expressions. See Flags for POSIX Regexps.

参数cflags让你能指定不同的选项来控制正则表达式的语法和语义。

 

If you use the flag REG_NOSUB, then regcomp omits from the compiled regular expression the information necessary to record how subexpressions actually match. In this case, you might as well pass 0 for the matchptr and nmatch arguments when you call regexec.

如果你使用REG_NOSUB标志,regcomp会在编译正则表过式的时候省略掉为了记录子表达式位置而必须包含的信息。在这种情况下,你可以在调用regexec的时候传递参数NULL到matchptr和0到nmatch。

 

If you don't use REG_NOSUB, then the compiled regular expression does have the capacity to record how subexpressions match. Also, regcomp tells you how many subexpressions pattern has, by storing the number in compiled->re_nsub. You can use that value to decide how long an array to allocate to hold information about subexpression matches.

如果你不使用REG_NOSUB,这样编译正则表达式的时候就会包含记录子表达式位置的能力。另外,regcomp通过保存信息到complied->rensub字段,能告诉你当前编译正则表达式有多少子表达式。你可以用这个值去决定申请需要多长的数组去只在这些子表达式的信息。

 

regcomp returns 0 if it succeeds in compiling the regular expression; otherwise, it returns a nonzero error code (see the table below). You can use regerror to produce an error message string describing the reason for a nonzero value; see Regexp Cleanup.

当编译正则表达式成功时recomp会返回0;否则 ,它会返回一个非0的错误代码(看下面的表格)。你可以使用regerror去产生非0错误代码所表达的错误信息。

 

Here are the possible nonzero values that regcomp can return:

下面是regcomp可能会返回的一些非0错误代码:

 

REG_BADBR

  There was an invalid'\{...\}'construct in the regular expression. A valid'\{...\}' construct must contain either a single number, or two numbers in increasing order separated by a comma. 

  正则表达式里包含非法{}结构。一个合法的{}结构必须包含一个数字或者两个用逗号隔开升序排序的数字。

REG_BADPAT

  There was a syntax error in the regular expression. 
  正则表达式有语法错误

REG_BADRPT

  A repetition operator such as ‘?’ or ‘*’ appeared in a bad position (with no preceding subexpression to act on). 

  一个重复操作符例如 ‘?’ 或 ‘*’出现在错误的位置。

REG_ECOLLATE

  The regular expression referred to an invalid collating element (one not defined in the current locale for string collation). See Locale Categories
  正则表达式指向一个非法的排序幸元素(一个在当前环境下非法排序的字符)。

REG_ECTYPE

  The regular expression referred to an invalid character class name. 

  正则表达式指同一个非法的字符类型名 。

REG_EESCAPE

  The regular expression ended with ‘\’. 

  正则表达式以‘\’结尾 。

REG_ESUBREG

  There was an invalid number in the ‘\digit’ construct. 

  在'\digit'结构里出现非法数字。

REG_EBRACK

  There were unbalanced square brackets in the regular expression. 
  在正则表达式有没有正常配对的方括号。

REG_EPAREN

  An extended regular expression had unbalanced parentheses, or a basic regular expression had unbalanced ‘\(’ and ‘\)’. 

  扩展的正则表达式包含没有正常配对的圆括号,或一个基础正则表达式包含没有正常配对的圆括号。

REG_EBRACE

  The regular expression had unbalanced ‘\{’ and ‘\}’. 

  正则表达式包含没有正常配对的中括号。

REG_ERANGE

  One of the endpoints in a range expression was invalid. 
  其中一个范围表达式的结束非法。

REG_ESPACE

  regcomp ran out of memory.

  regcomp运行内在溢出。

 


 

Flags for POSIX Regular Expressions(POSIX正则表达式标志)

These are the bit flags that you can use in the cflags operand when compiling a regular expression with regcomp.

这是你在调用regcomp编译正则表达式时可以用于cflags参数的一些位标志说明。

 

REG_EXTENDED

Treat the pattern as an extended regular expression, rather than as a basic regular expression. 
把当前模式当作一个扩展的正则表达式对待,而不是一个基本的正则表达式。

REG_ICASE

Ignore case when matching letters. 
匹配字母的时候不区分大小。

REG_NOSUB

Don't bother storing the contents of the 

matches-ptr

 array. 
不要存储内容到matches-ptr数组。

REG_NEWLINE

Treat a newline in string as dividing string into multiple lines, so that ‘$’ can match before the newline and ‘^’ can match after. Also, don't permit ‘.’ to match a newline, and don't permit ‘[^...]’ to match a newline,Otherwise, newline acts like any other ordinary character.

把字符串里的换行符当作多行记录的分隔,因为‘$'可以在新行之前匹配和‘^'可以匹配新行的开始。还有,不允许‘.'匹配换行符,和不允许‘[^...]’匹配新行.

 


Matching a Compiled POSIX Regular Expression(匹配一个编译后的POSIX正则表达式)

Once you have compiled a regular expression, as described in POSIX Regexp Compilation, you can match it against strings using regexec. A match anywhere inside the string counts as success, unless the regular expression contains anchor characters (‘^’ or ‘$’).

当你有一个编译后的正则表达式,像在(POSIX表达式编译)那里描述的,你可以调用regexec去匹配一个字符串。在字符串的任意位置匹配都算成功,除非正则表达式包含锚点字符(‘^'或者‘$')。

 

— Function: int regexec (const regex_t *restrict compiled, const char *restrict string, size_t nmatch, regmatch_t matchptr[restrict], int eflags)

This function tries to match the compiled regular expression *compiled against string.

这个函数尝试使用*compiled去匹配字符串string。

 

regexec returns 0 if the regular expression matches; otherwise, it returns a nonzero value. See the table below for what nonzero values mean. You can use regerror to produce an error message string describing the reason for a nonzero value; see Regexp Cleanup.

如果正则表达式匹配成功regexec返回0;否则,它返回一个非0值。看看下面的表格看非0值代表什么意思。你可以使用regerror去产生描述非0值所代表的错误信息字符串。

 

The argument eflags is a word of bit flags that enable various options.

参数eflags是用于打开不同选项的。

 

If you want to get information about what part of string actually matched the regular expression or its subexpressions, use the arguments matchptr and nmatch. Otherwise, pass 0 for nmatch, and NULL for matchptr. See Regexp Subexpressions.

如果你想得到那些字符或者子表达式匹配当前正则表达式,使用参数matchptr和nmatch。否则,传递0到nmatch,和NULL到matchptr。

 

You must match the regular expression with the same set of current locales that were in effect when you compiled the regular expression.

实际上你必须保持编译正则表达式和匹配正则式的当前的环境一致。

 

The function regexec accepts the following flags in the eflags argument:

函数接受以下标志作为eflags参数的值:

REG_NOTBOL

  Do not regard the beginning of the specified string as the beginning of a line; more generally, don't make any assumptions about what text might precede it.

  不要认为指定的字符的开始作为一行的开始;一般地说,不要做假设可能文本在它之前。

REG_NOTEOL

  Do not regard the end of the specified string as the end of a line; more generally, don't make any assumptions about what text might follow it.

  不要认为指定的字符的结束作为一行的结束;一般地说,不要做假设可能有文本接着它。

 

Here are the possible nonzero values that regexec can return:

这里是regexex可能返回的非0值:

REG_NOMATCH

  The pattern didn't match the string. This isn't really an error. 

  模式没有匹配字符串,这不是一个错误。

REG_ESPACE

  regexec ran out of memory.

  regexec运行内容溢出。

 


 Match Results with Subexpressions(匹配结果包含子表达式)

When regexec matches parenthetical subexpressions of pattern, it records which parts of string they match. It returns that information by storing the offsets into an array whose elements are structures of type regmatch_t. The first element of the array (index 0) records the part of the string that matched the entire regular expression. Each other element of the array records the beginning and end of the part that matched a single parenthetical subexpression.

当regexec匹配模式里面括号包含的子表达式时,它记录下那部分的字符串匹配它。它通过返回一个存储位置包含regmatch_t类型结构的数组来返回这些信息。数组的第一个元素记录了匹配整个正则表达式的字符串部分。其它数组元素记录匹配一个括号包含的子表达式的开始和结束位置。

 

— Data Type: regmatch_t

This is the data type of the matcharray array that you pass to regexec. It contains two structure fields, as follows:

这是你传递到regexec函数的matcharray数组的数据类型。它包含两个结构字段,如下:

rm_so
The offset in string of the beginning of a substring. Add this value to string to get the address of that part. 
子串在字符串中的开始位置。
rm_eo
The offset in string of the end of the substring.
子串在字符串中的结束位置。
— Data Type: regoff_t

regoff_t is an alias for another signed integer type. The fields of regmatch_t have type regoff_t.

regoff_t是integer类型的一个别名。regmatch_t的字段包含类型regoff_t。

The regmatch_t elements correspond to subexpressions positionally; the first element (index 1) records where the first subexpression matched, the second element records the second subexpression, and so on. The order of the subexpressions is the order in which they begin.

regmatch_t元素代表子表达式的位置;第一个元素(索引1)记录第一个子表达式的匹配位置,第二个元素匹配第二个子表达式的位置,以此类推。子表达式的位置就是元素的所在位置。

 

When you call regexec, you specify how long the matchptr array is, with the nmatch argument. This tells regexec how many elements to store. If the actual regular expression has more than nmatchsubexpressions, then you won't get offset information about the rest of them. But this doesn't alter whether the pattern matches a particular string or not.

当你调用regexec时,你通过nmatch指定matchptr数组的长度。这告诉regexec有多少元素可以存储。如果实际正则表达式包含多于nmatch的子表达式时,这里剩下的那些子表达式你就没有办法拿到相关位置信息了。但这不会改变当前模式是否匹配一个字符串。

 

If you don't want regexec to return any information about where the subexpressions matched, you can either supply 0 for nmatch, or use the flag REG_NOSUB when you compile the pattern with regcomp.

如果你不想regexec返回任何子串匹配的信息,你可以传递0到nmatch,或者使用标志REG_NOSUB去调用regcomp编译正则表达式。

 


 

Complications in Subexpression Matching(子表达式匹配的一些问题)

Sometimes a subexpression matches a substring of no characters. This happens when ‘f\(o*\)’ matches the string ‘fum’. (It really matches just the ‘f’.) In this case, both of the offsets identify the point in the string where the null substring was found. In this example, the offsets are both 1.

有些时候子表达式匹配空字符串。当用‘f\(o*\)’匹配字符串 ‘fum’时就会发生这种情况。(它实际上只配置了字符‘f’)在这种情况下,两个位置识别点都在空字符串里匹配到。在这个例子,位置都为1.

 

Sometimes the entire regular expression can match without using some of its subexpressions at all—for example, when ‘ba\(na\)*’ matches the string ‘ba’, the parenthetical subexpression is not used. When this happens, regexec stores -1 in both fields of the element for that subexpression.

有些时候整个正则表达式可以匹配不包含子表达式-例如:当用‘ba\(na\)*’去匹配字符串‘ba’,括号包含的子表达式没有使用。当这种情况发生时,regexec存储-1到子表达式匹配的位置信息里。

 

Sometimes matching the entire regular expression can match a particular subexpression more than once—for example, when ‘ba\(na\)*’ matches the string ‘bananana’, the parenthetical subexpression matches three times. When this happens, regexec usually stores the offsets of the last part of the string that matched the subexpression. In the case of ‘bananana’, these offsets are 6 and 8.

有些时候整个正则表达式可以匹配子表达式可能会匹配多次-例如,当用‘ba\(na\)*’匹配字符串‘bananana’时,括号包含的子表达式匹配了三次。当这种情况发生时,regexec通常存储最后一个匹配子表达式的字符串的位置。在‘bananana’这个例子,开始位置为6和结束位置为8.

 

But the last match is not always the one that is chosen. It's more accurate to say that the last opportunity to match is the one that takes precedence. What this means is that when one subexpression appears within another, then the results reported for the inner subexpression reflect whatever happened on the last match of the outer subexpression. For an example, consider ‘\(ba\(na\)*s \)*’ matching the string ‘bananas bas ’. The last time the inner expression actually matches is near the end of the first word. But it is considered again in the second word, and fails to match there.regexec reports nonuse of the “na” subexpression.

但最后匹配也不一定是最后被选中的那个。更准确来说最后匹配的那个字符串优先。这个意思是说当一个子表达式出现在另外一个里面时,结果返回里面的子表达式而不管外面的最后一个匹配的子表达式。例如,考虑‘\(ba\(na\)*s \)*’匹配字符串‘bananas bas ’时,最后里面的子表达式在第一个单词的最后匹配。但它考虑第二个单词,而且匹配失败。regexec返回没有找到“na”子表达式的匹配。

 

Another place where this rule applies is when the regular expression

另外一个地方应用这种规则就是当正则表达式如下时:

     \(ba\(na\)*s \|nefer\(ti\)* \)*

matches ‘bananas nefertiti’. The “na” subexpression does match in the first word, but it doesn't match in the second word because the other alternative is used there. Once again, the second repetition of the outer subexpression overrides the first, and within that second repetition, the “na” subexpression is not used. So regexec reports nonuse of the “na” subexpression.

匹配字符串‘bananas nefertiti’。“na”子表达式在第一个单词匹配。但因为二选一模式被使用时,第二个正则表达式的子表达式没有匹配到第二个单词。再一次,第二次匹配重写了第一次,在第二次重写时, “na” 没有被匹配。所以regexec返回“na”子串没有被使用到。

 


 

POSIX Regexp Matching Cleanup(POSIX正则匹配清除)

When you are finished using a compiled regular expression, you can free the storage it uses by calling regfree.

当你使用完一个编译后的正则表达式时,你可以使用函数regfree清除它。

 

— Function: void regfree (regex_t *compiled)

Calling regfree frees all the storage that *compiled points to. This includes various internal fields of the regex_t structure that aren't documented in this manual.

调用regfree 释放*compiled指向的资源。这包括regex_t结构各种没有在这里说明的内部字……

regfree does not free the object *compiled itself.

regfree不会清除*compiled对象本身。

 

You should always free the space in a regex_t structure with regfree before using the structure to compile another regular expression.

在使用regex_t结构去编译另外一个正则表达式之前你必须调用 regfree去释放regex_t结构包含的空间。

 

When regcomp or regexec reports an error, you can use the function regerror to turn it into an error message string.

当regcomp或者regexec返回错误时,你可以调用函数regerror去转换错误代码到错误信息。

 

— Function: size_t regerror (int errcode, const regex_t *restrict compiled, char *restrict buffer, size_t length)

This function produces an error message string for the error code errcode, and stores the string in length bytes of memory starting at buffer. For the compiled argument, supply the same compiled regular expression structure that regcomp or regexec was working with when it got the error. Alternatively, you can supply NULL for compiled; you will still get a meaningful error message, but it might not be as detailed.

这个函数产生错误代码所代表的错误信息,存储信息到从buffer开始length长的内存空间里面。

If the error message can't fit in length bytes (including a terminating null character), then regerror truncates it. The string that regerror stores is always null-terminated even if it has been truncated.

如果错误信息超过length长,regerror会截断它。regerror返回的错误信息总是以null结束的,即使它被截断。

The return value of regerror is the minimum length needed to store the entire error message. If this is less than length, then the error message was not truncated, and you can use it. Otherwise, you should call regerror again with a larger buffer.

regerror返回的是需要存储整个错误信息的最小长度。如果这个比length少,错误信息不会被截断,你可以使用它。否则,你应该使用大一点的buffer去调用regerror.

Here is a function which uses regerror, but always dynamically allocates a buffer for the error message:

下面是一个使用regerror的函数,它会动态申请buffer存储错误信息

          char *get_regerror (int errcode, regex_t *compiled)
          {
            size_t length = regerror (errcode, compiled, NULL, 0);
            char *buffer = xmalloc (length);
            (void) regerror (errcode, compiled, buffer, length);
            return buffer;
          }
posted @ 2013-03-07 16:02  deaconx  阅读(721)  评论(0编辑  收藏  举报