Programming with GNU Regex Library 中文翻译 - clq

公告

http://www.newsmth.net/pc/pccon.php?id=4954&nid=118147

--------------------------------------------------

Programming with GNU Regex Library
version 1.0
技術報告： 96004
Date Jun. 14, 1996

ASPAC 計劃
中央研究院計算中心
工作站實驗室
Computing Center of Academia Sinica
Workstation Lab.

E-mail: aspac@phi.sinica.edu.tw

--------------------------------------------------------------------------------

ASPAC 計畫版權聲明

ASPAC（Academia Sinica PACkage）是中央研究院計算中心關於 ``軟體工具使用 '' (Software Tools) 及 ``問題解決'' (Problem Solving) 的計畫。在這計畫下所發展之軟體及文件都屬於中央研究院計算中心所有。所有正式公開之電子形式資料（包括軟體及文件），在滿足下列軟體及文件使用權利說明下，都可免費取得及自由使用。軟體及文件使用權利說明如下：
軟體的使用權利：
將沿用美國 FSF（Free Software Foundation）1991年6月第二版的
GNU General Public License。
文件的使用權利：
文件可以自由拷貝及引用，但不得藉以圖利。除非必要手續費的收取。

--------------------------------------------------------------------------------
目次：
背景：
什麼是Regular Expression？
什麼是GNU Regex程式庫？它有那些功能？
如何取得以及建立GNU Regex程式庫？
Regular Expression的語法：
.語法變數的定義位元
預先設好的語法變數值
特殊控制字元 \ 的使用
Regular Expression的運作元素：
Common
GNU
Emacs
GNU Regex程式庫的程式寫作：
GNU Regex程式庫中的函數
GNU特有介面的函數
編譯Regular Expression的函數
進行比對的函數
進行尋找的函數
在兩組字串進行比對的函數
在兩組字串進行尋找的函數
使用fastmap編譯Regular Expression的函數
POSIX相容介面的函數
編譯Regular Expression的函數
進行尋找的函數
錯誤報告的函數
釋放編譯過Regular Expression buffer的函數
BSD相容介面的函數
編譯Regular Expression的函數
進行尋找的函數
程式範例
GNU特有介面的函數
POSIX相容介面的函數
BSD相容介面的函數
附錄：
參考文件：

--------------------------------------------------------------------------------

1. 背景
Regular expression 具有可以表達出難以描述、複雜、但是卻有特殊規則的字串的功能，所以許多的 UNIX 工具程式都有支援 Regular expression 的功能。例如 ex 、 vi 、 sed 、 awk 、 grep 、 emacs 等等都有支援。除了這些具有 regular expression 功能的現成工具外，另外還有一類俱有 regular expression 功能的程式庫，可以供程式設計者很容易地在其程式中加入 regular expression 功能。例如 GNU 所發表的 Regex 程式庫便是屬於此類。本文就是要介紹如何利用 GNU Regex 程式庫，使自己的程式具有 regular expression 的功能。
在做 GNU Regex 程式庫的程式寫作之前，有必要先了解：

什麼是 Regular Expression？
什麼是 GNU Regex 程式庫？它有那些功能？
如何取得以及建立 GNU Regex 程式庫？
以下幾節便分別就這幾點對 GNU Regex 程式庫作一些簡介。

--------------------------------------------------------------------------------
1.1 什麼是Regular Expression？
Regular Expression是以一文字字串來表達"具有某特殊規則"的所有字串集合。例如 Regular Expression "fo*" 代表由 "fo" 、 "foo" 、 "fooo" 、 ... 等等所成的字串集合。如果一字串 A 是 Regular Expression 'fo*'所成的字串集合中的一字串，那我們就稱 Regular Expression 'fo*'match 字串 A。至於詳細的 Regular Expression 介紹，讀者可以參考中央研究院計算中心 ASPAC 計劃的 Regular Expression Introduction [2]。

--------------------------------------------------------------------------------
1.2 什麼是GNU Regex程式庫？它有那些功能？
GNU Regex 程式庫是 GNU 發展，提供操作比對 Regular Expression 文字字串的程式庫，也就是使用 GNU Regex 程式庫，可以作到以下的功能：

比對一字串是否完全與 Regular Expression 相幅合。
在一字串中尋找與 Regular Expression 相幅合的子字串。
GNU Regex 程式庫主要包含 regex.c 與 regex.h 兩個檔案。在 regex.c 中提供三組的函數程式，包括：
GNU特有的函數：功能較強，但是介面是由 GNU所設計，沒有與其他兩組相容。
POSIX相容的函數：功能居次，介面與 POSIX相同。
BSD相容的函數：功能最少，介面與 Berkeley UNIX相同。

--------------------------------------------------------------------------------
1.3 如何取得以及建立GNU Regex程式庫？
要取得 GNU Regex 程式庫，可以由公共的 ftp 伺服器下載。例如由：

ftp://phi.sinica.edu.tw/pub/GNU/gnu/regex-0.12.tar.gz 或者
ftp://prep.ai.mit.edu/pub/gnu/regex-0.12.tar.gz 。

要注意 GNU 另外有一個 Rx 程式庫，它是一較新的 POSIX.2 標準介面的 regular expression library。要取得 GNU Rx 程式庫，亦可以由公共的 ftp 伺服器下載。例如由：

ftp://phi.sinica.edu.tw/pub/GNU/gnu/rx-1.0.tar.gz 或者
ftp://prep.ai.mit.edu/pub/gnu/rx-1.0.tar.gz 。

基本上 GNU Rx 函數與 GNU Regex 中的 POSIX 相容介面函數，在函數名稱、使用介面和函數個數上是相同地，但是在函數的內部運作上二者有很大的不同。本文因主要在於介紹 GNU Regex 程式庫的使用法，所以對於 GNU Rx 程式庫就不做介紹。

在取得 GNU Regex 程式庫之後便可以進行建立 GNU Regex 程式庫的工作。建立 GNU Regex 程式庫的程序可以分成五個步驟：

取得 regex-0.12.tar.gz ，並且將之解開
(zcat regex-0.12.tar.gz | tar -xvf - )。
執行 configure script ，將 GNU Regex 程式庫環境設定。
執行 make ，將 GNU Regex 程式庫 compile 好。
執行 make check ，檢查 GNU Regex 程式庫是否已正確地被 compile 好。
執行 make install ，將 GNU Regex 程式庫以及文件正確安裝到系統之中。
當正確安裝完 GNU Regex 程式庫後，便可以開始做 GNU Regex 程式庫的程式寫作。
--------------------------------------------------------------------------------
2. Regular Expression的語法
在使用 GNU Regex 程式庫之前，最好先了解 Regular Expression 是如何運作。要了解 Regular Expression 的運作情形，則必需了解 Regular Expression 的語法才行。本章便就 Regular Expression 的語法及其控制的方式做一簡介。

在前面曾經提過 Regular Expression 是以一文字字串來表達"具有某特殊規則"的字串集合，此一文字字串便稱為"表示字串"。Regular Expression 的表示字串的內容主要可以分成兩大類：

Characters：普通字元，所代表的意義與原字面的意義相同。
Operators：特殊字元，代表某種特殊規則的意義。
例如："foo" 的 Regular Expression 表示字串，就是代表字串 "foo" 的意思。而 " [Ff]oo" 的 Regular Expression 表示字串，是代表字串 "Foo" 或字串 "foo" 的意思。至於特殊字元在何種情況下，可以是代表原字面的意義，便要視下面兩種情形而定：
語法變數的設定情形：此情形是在程式中以設定語法變數的定義位元的方式來控制 GNU Regex 程式庫中的程式運作。
Regular Expression 表示字串中的字元內容組合情形：例如特殊字元 * 與特殊控制字元 \ 組合使用時，特殊字元 * 是代表原字面的意思。

--------------------------------------------------------------------------------
2.1 語法變數的定義位元
語法變數的設定是在控制 GNU Regex 程式庫的程式如何處理 Regular Expression 表示字串中的特殊字元，也就是屬於在程式之中的控制設定。其使用法為在呼叫使用 GNU Regex 程式庫中的程式前，在自己的程式中先將 re_syntax_options 這個變數設定成所要的語法變數的定義位元 (Syntax Bits) 即可。至於使用法的範例，讀者可以參考後面 4.1.1 節中的函數使用範例。

因為語法變數的設定情形會影響到特殊字元的功能及用法，所以有必要了解清楚。以下是 GNU Regex 程式庫中，可以控制特殊字元的功能及用法的 Regular Expression 語法變數的定義位元 (Syntax Bits)：

RE_BACKSLASH_ESCAPE_IN_LISTS：設定為 on 的話， \ 代表特殊控制字元，否則不是。例如當設為 on 時，"[\.]"是代表"[.]"的意思，而不設為 on 時，"[\.]"是代表"[\.]"的意思。
RE_BK_PLUS_QM：設定為 on 的話，以 \+ 與 \? 代表對應到重複出現至少一次以上的字元與對應到重複出現零次或一次的字元，否則以 + 與 ? 來代表這兩個特殊控制字元。
RE_CHAR_CLASSES：設定為 on 的話，才可以在 list 中使用 char classes ，否則不能。
RE_CONTEXT_INDEP_ANCHORS：設定為 on 的話，^ 與 $ 在任何 list 之外都是特殊控制字元。
RE_CONTEXT_INDEP_OPS：設定為 on 的話， * 、 + 與 ? 在任何 list 之外都是特殊控制字元。
RE_CONTEXT_INVALID_OPS：設定為 on 的話，重複性的控制字元 ( * 、+ 、 . ) 不可以出現在第一個位置或跟在 ^ 控制字元之後或在 $ 控制字元之前一個位置或者在 open group 之前或之後。
RE_DOT_NEWLINE：設定為 on 的話，代表對應到任何字元的特殊控制字元 . 可以對應到 New Line 這個字元。
RE_DOT_NOT_NULL：設定為 on 的話，代表對應到任何字元的特殊控制字元 . 不可以對應到 NULL 這個字元。
RE_INTERVALS：設定為 on 的話，才可以使用區間式的特殊控制字元。例如設定為 on 時，就可以使用如 "fo{4}" 的指定重複出現的次數或者使用如 "fo{2,4}" 的指定重複出現的次數區間。
RE_LIMITED_OPS：設定為 on 的話，不可以使用對應到重複出現至少一次以上 + 與對應到重複出現零次或一次 ? 的特殊控制字元。
RE_NEWLINE_ALT：設定為 on 的話，在 regular expression 中的 New Line 字元是代表"或者"的特殊控制字元。
RE_NO_BK_BRACES：設定為 on 的話，使用 { 與 } 的特殊控制字元，否則是使用 \{ 與 \} 的特殊控制字元。
RE_NO_BK_PARENS：設定為 on 的話，使用 ( 與 ) 的特殊控制字元來表示 group 的意思，否則是使用 $ 與 $ 的特殊控制字元來表示 group 的意思。
RE_NO_BK_REFS：設定為 on 的話，不可以使用"\數字"的 reference 特殊控制字元。
RE_NO_BK_VBAR：設定為 on 的話，使用 | 的特殊控制字元代表"或者"的意思，否則是使用 \| 的特殊控制字元代表"或者"的意思。
RE_NO_EMPTY_RANGES：設定為 on 的話，使用指定範圍式的特殊控制時，其結尾值不得低於啟始值，否則屬於無效。
RE_UNMATCHED_RIGHT_PAREN_ORD：設定為 on 的話，如果 regular expression 中沒有 ")" 的話，會根據 RE_NO_BK_PARENS 的設定情形，以其他的控制字元來 match ")"。

--------------------------------------------------------------------------------
2.2 預先設好的語法變數值
根據上述的語法變數的定義位元， GNU Regex 程式庫在其標頭檔 (header file) regex.h (詳見附錄)中有一些預先設好、複合的語法變數值，可以方便寫程式者直接加以應用。另外也可以由此預先設好、複合的語法變數值中，看出一些 GNU tools 程式，對於 Regular Expression 語法中的特殊控制字元的接受情形。

--------------------------------------------------------------------------------
2.3 特殊控制字元 \ 的使用
特殊控制字元 \ 的使用在 regular expression 中可以說是相當的重要。特殊控制字元 \ 的用法共計有四種意義如下：

代表它本身：例如當 RE_BACKSLASH_ESCAPE_IN_LISTS 沒有設定的話， [\] 是代表 \ 的意思。
代表還原下一個特殊控制字元：例如 \. 是代表 . 這個字元，也就是說 . 不再是代表對應到任何字元的意思了。
代表另一種 (GNU) 控制字元組合：例如 \b 、 \B 、 \< 、 \> 、 \w 、 \W 、 \` 、 \' 等幾種。詳見 3.2.節中 GNU 的運作元素。
不代表任何字元：在大部分的情況下 \ 是可以忽略的，例如 \n 是代表 n 的意思。

--------------------------------------------------------------------------------
3. Regular Expression的運作元素
Regular Expression 主要便是靠表示字串中的普通字元與特殊字元來運作，所以在此將這些運作元素做一簡單的介紹。在GNU Regex 程式庫中的 Regular Expression 運作元素可以大致分成三類：

Common：一般性的運作元素。
GNU：GNU Regex 特有的運作元素。
Emacs：Emacs 特有的運作元素。
以下的幾節中便分別就這三類的運作元素做一些簡單的介紹。

--------------------------------------------------------------------------------
3.1 Common
這一類主要是用在 POSIX 標準的 Regular Expression 程式中，但是在 GNU Regex 程式中也可以使用。在這一類中大部分的運作元素都有兩種表示的方式：

直接以單一特殊符號做為表示的方式。
以單一特殊符號並在其之前再加上一 \ 符號做為表示的方式。
至於什麼時候要加上 \ 符號，則是以語法變數的定義位元如何設定為依據。關於如何設定語法變數的定義位元，請參考前面 2.1 節中的介紹。以下是這一類的運作元素：
代表本身的普通字元：除了 . ﹑ * ﹑ ? ﹑ + ﹑ ^ ﹑ $ ﹑ | ﹑ \ ﹑ [ ﹑ ] ﹑ { ﹑ } ﹑ ( ﹑ ) ﹑ < ﹑ > 等以外的字元，例如 "here" 便是代表 "here" 的意思。
代表對應到任何字元： . ，例如 "foo." 代表 "foo" 再加上任何一個字元的意思。
代表對應到重複出現的字元： * ，例如 "foo*" 代表 "foo" 或者 "fooo" 或者 "fooo" ... 所成集合的意思。
代表對應到重複出現至少一次以上的字元： + 或 \+ ，例如 "foo+" 代表 "fooo" 或者 "fooo" 或者 "foooo" ... 所成集合的意思。
代表對應到重複出現零次或一次的字元： ? 或 \? ，例如 "foo?" 代表 "foo" 或者 "fooo" 所成集合的意思。
代表指定重複出現的次數或次數區間： {count} 或 {min,} 或 {min,max} ，例如 "foo{2}" 代表 "foooo" 所成集合的意思。
代表或者： | 或 \| ，例如 "foo|FOO" 代表 "foo" 或 "FOO" 的意思。
代表一群List： [...] 或 [^...] ，例如： "[fo]" 代表 "f" 或 "o" 的意思，而 "[^fo]" 代表非 "f" 而且非 "o" 的意思。
代表整個Class： [: ... :] ，例如：
[:alnum:] 代表英文字母或阿拉伯數字，也就是 a-z ， A-Z ， 0-9
[:alpha:] 代表英文字母，也就是 a-z ， A-Z
[:blank:] 代表空白鍵 (space) 或跳位鍵 (tab)
[:cntrl:] 代表一些控制字元，以 ASCII 編碼方式為介在碼 040 到碼 0177 之間的字元
[:digit:] 代表阿拉伯數字，也就是 0-9
[:graph:] 代表除了空白鍵 (space) 以外，可以列印出來的字元
[:lower:] 代表小寫的英文字母，也就是 a-z
[:print:] 代表可以列印出來的字元
[:punct:] 代表除了 [:cntrl:] 與 [:alnum:] 之外的字元
[:space:] 代表空白鍵 (space) 或換行鍵 (newline or carriage return) 或直向跳位鍵 (vertical tab) 或換頁鍵 (form feed)
[:upper:] 代表大寫的英文字母，也就是 A-Z
[:xdigit:] 代表十六進位的數字，也就是 0-9 ， a-f ， A-F
代表指定範圍： - ，例如 "[2-4]" 代表 "2" 、 "3" 或 "4" 的意思。
代表一群group： (...) 或 $...$ ，例如 "(ab)+" 代表 "ab" ， "abab" ， "ababab" ， ... ，的意思。
代表back-reference前面的group： \digit ，例如 "(a)b\1c\1" 代表 "abaca" 的意思。
代表對應到行首的位址： ^ ，例如 "^foo" 代表出現在行首位址的 "foo" 的意思。
代表對應到行尾的位址： $ ，例如 "foo$" 代表出現在行尾位址的 "foo" 的意思。
另外在中央研究院計算中心 ASPAC 計劃的 Regular Expression Introduction[2] 中，有詳細的 Regular Expression 介紹可以參考。

--------------------------------------------------------------------------------
3.2 GNU
這一類主要是用在 GNU Regex 的程式之中，在 POSIX 標準的程式中並不能使用。

對應到字邊界的空字串： \b ，例如對於以 "\bhere\b" 尋找字串 "there and here" 而言，可以找到 "here" 這個完整的字，而非 "there" 字中的 "here" 部分。
對應到字內的空字串： \B ，例如對於以 "\Bhere" 尋找字串 "there and here" 而言，可以找到 "there" 字中的 "here" 部分，而不會找到 "here" 這個完整的字。
對應到字首的空字串： \< ，例如對於以 "\<here" 尋找字串 "there and here" 而言，只會找到 "here" 這個完整的字，而非 "there" 字中的 "here" 部分。
對應到字尾的空字串： \> ，例如對於以 "here\>" 尋找字串 "there and here" 而言，可以先找到 "there" 字中的 "here" 部分。
對應到任何可以組成字的字元： \w ，例如對於以 "\where" 尋找字串 "?here and there" 而言，會找到 "there"。
對應到任何不可以組成字的字元： \W ，例如對於以 "\Where" 尋找字串 "?here and there" 而言，會找到 "?here"。
對應到buffer的開始位址： \` ，例如對於以 "there\'" 尋找字串 "there and here" 與字串 "here and there" 而言，會找到 "here and there" 這個字串結尾的 "there"。
對應到buffer的最後位址： \' ，例如對於以 "\`there" 尋找字串 "there and here" 與字串 "here and there" 而言，會找到 "there and here" 這個字串開頭的 "there"。

--------------------------------------------------------------------------------
3.3 Emacs
這一類只能用在 GNU Regex 的程式之中，在 POSIX 標準的程式中並不能使用。而且在建立 GNU Regex 函數庫之前，必須先定義 emacs 這個前處理變數 (preprocessor symbol) ，然後再建立 GNU Regex 函數庫，才能使用。也就是說，只限於在重新 build GNU Regex 函數庫時，以定義 C compiler 的前處理變數 emacs 的方式，來進行建立 GNU Regex 函數庫，才會有此類的功能。至於如何先定義 emacs 這個前處理變數，使用者可以用修改 Makefile 的方式來定義 emacs這個前處理變數給 C compiler 用，只要在 GNU Regex 的 Makefile 中的 DEFS 那一行之末再加入 -Demacs 就可以了。

另外在使用這一類的功能時，必需在程式之中設定 re_syntax_table 成為 Emacs syntax table 才行。然而 Emacs syntax 比 Regex syntax 還要複雜得多，所以有興趣者可以自行參考 "GNU Emacs User's Manual" 中有關 syntax 那一節，本文就不在此處多加介紹了。

--------------------------------------------------------------------------------
4. GNU Regex程式庫的程式寫作
GNU Regex 程式庫的程式一共有 GNU 介面、 POSIX 相容介面、 BSD 相容介面等三種介面函數組，但是不論是那一種介面函數組，其主要的程式寫作流程都可以如下三個部驟：

準備好 Regular Expression 的表示字串
編譯 Regular Expression 的表示字串成為 pattern buffer
以 pattern buffer 進行比對或者尋找的功能
如果以 pseudo code 來描述這主要的程式流程：
  Program Regex:
  {
    Setup regular expression string;
    Initialize regular expression pattern buffer;

    do ( compile regular expression into pattern buffer );
    if ( compiling successfully ) {
      do ( match or search string with pattern buffer );
      if ( matching or searching successfully ) {
        report matching or searching successed;
      }
      else {
        report matching or searching failed;
      }
    }
    else {
      report compiling error;
    }
  }
以下 4.1.節便就三種介面函數分別介紹其功能與用法，並且針對三種介面函數中的每一個函數，列舉出一簡單的使用例子。而在 4.2.節中則列舉出三種介面函數完整的使用例子與測試結果。

--------------------------------------------------------------------------------
4.1 GNU Regex程式庫中的函數
因為 GNU Regex 程式庫有三種介面，一種專為 GNU 所設計的介面，一種為 POSIX 相容的介面，另一種為 Berkeley UNIX 相容的介面，所以下面便分成這三部分來介紹。至於這三種介面程式的優缺點，在前面 1.2.節只做簡單的介紹，所以在此加以補充介紹。


[GNU特有的函數：]
GNU 特有的函數組所提供的函數個數較多，對於函數功能的劃分也就分得較細。其優點在於進行搜尋時不只是得知是否成功，而且可以得知所欲搜尋的 pattern 在被搜尋字串中的起始位置。另外還提供一次在兩個字串中進行搜尋或比對的功能。這項功能可以協助搜尋在一篇文章中跨兩行的 pattern 。而其缺點在於函數的使用介面參數較多，而且設定較繁複，對於初學者而言會較不易上手。另外因其函數介面為GNU 所特有的，所以可攜性在此三種介面中是最差。
[POSIX相容的函數：]
POSIX 相容的函數組所提供的函數個數較少於 GNU 特有的函數組，而且其搜尋函數無法像 GNU 特有的函數組中的搜尋函數一樣，可以得知所欲搜尋的 pattern 在被搜尋字串中的起始位置。但是因為其函數介面為 POSIX 標準相容，所以可攜性在此三種介面中是最好。
[BSD相容的函數：]
相容的函數組所提供的函數個數是此三種介面中最少的，只有兩個。而且其搜尋函數只能得知是否成功，無法得知所欲搜尋的 pattern 在被搜尋字串中的起始位置。但是因為其函數的使用介面參數最少，而且設定簡單，使用函數內定的 buffer ，讓使用者無需費心，所以 BSD 相容的函數組對於初學者而言會較容易使用。

--------------------------------------------------------------------------------
4.1.1 GNU特有介面的函數
GNU 介面的 Regex 函數共有六個，如下：

編譯 Regular Expression 的函數：re_compile_pattern()
進行比對的函數：re_match()
進行尋找的函數：re_search()
在兩組字串中進行比對的函數：re_match_2()
在兩字串中進行尋找的函數：re_search_2()
使用 fastmap 編譯 Regular Expression 的函數：re_compile_fastmap()
以下便就這些函數加以說明：
4.1.1.1. 編譯Regular Expression的函數

使用GNU介面的 regex 函數庫的第一件工作，就是把 Regular Expression 的表示式字串，編譯成 regex 函數程式可以使用的 pattern buffer 。至於 pattern buffer 的結構定義，是在 regex.h 中定義，請參考附錄。

[函數的介面：]
char *re_compile_pattern(const char *regex,
                         const int regex_size,
                         struct re_pattern_buffer *pattern_buffer)
[函數的參數：]
regex 是要編譯的 regular expression 的位址， regex_size 是 regex 的大小， pattern_buffer 是準備要放置編譯後的 Regular Expression buffer 的位址。
[函數的回傳值：]
"0" 表示編譯成功， "非空字串" 表示編譯失敗的錯誤字串。
[注意事項：]
在進行編譯 Regular Expression 之前，至少要 initialize pattern buffer 結構中的 translate 、 fastmap 、 buffer 、 allocated 等等。
[函數使用的範例：]
以下是進行編譯 regular expression 為 "[Ff]oo" 的範例
/*
   regex_pattern  : Regular Expression 的表示式字串
   pattern_buffer : 給 regex 函數使用的 pattern buffer 結構
   errcode        : 編譯的結果字串
**/
  const char regex_pattern[7] = "[Ff]oo";
  struct re_pattern_buffer pattern_buffer;
  const char *errcode;

/* initialize pattern buffer 結構 */
  pattern_buffer.allocated = 0;
  pattern_buffer.buffer = 0;
  pattern_buffer.fastmap = 0;
  pattern_buffer.translate = 0;
/* 設定語法的定義 */
  re_syntax_options = RE_SYNTAX_EGREP;

/* 進行編譯 Regular Expression */

errcode = re_compile_pattern( regex_pattern,
                              strlen(regex_pattern),
                              &pattern_buffer);
4.1.1.2. 進行比對的函數
進行比對是指用一 regular expression 從一字串的某個位置比對起，看符合比對的子字串有多長。

[函數的介面：]
int re_match(struct re_pattern_buffer *pattern_buffer,
             const char *string,
             const int size,
             const int start,
             struct re_registers *regs)
[函數的參數：]
pattern_buffer 是放置編譯後的 Regular Expression buffer 的位址， string 是要進行比對的字串， size 是要進行比對的字串的大小， start 是要進行比對的字串的開始位址， regs 是進行比對過程所有的 match 情形。
[函數的回傳值：]
"-1" 表示比對失敗， "-2" 表示內部錯誤， "非負整數" 表示合於比對的個數。
[注意事項：]
在比對的字串中可以含有 newline 和 null 的字元。另外如果 start 值沒有介在 0 與 size 之間，則永遠會比對失敗。
[函數使用的範例：]
以下是進行比對 regular expression 為 "[Ff]oo" 的範例
/*
   n              : 比對的結果值
   textstring     : 要被 "[Ff]oo" 進行比對的字串
   regs           : 是進行比對過程中所有的 match 情形
   pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構
**/
  int n;
  const char *textstring;
  struct re_registers regs;
  struct re_pattern_buffer pattern_buffer;

/* 從 teststring 的啟始位置起來進行比對 */

n = re_match( &pattern_buffer,
              textstring,
              strlen(textstring),
              0,
              &regs);
4.1.1.3.進行尋找的函數
進行尋找是指用一 regular expression 從一字串的某個位置尋找起，看是否有子字串符合的，如果有的話，並且把子字串所起始的位置回傳。

[函數的介面：]
int re_search(struct re_pattern_buffer *pattern_buffer,
              const char *string,
              const int size,
              const int start,
              const int range,
              struct re_registers *regs)
[函數的參數：]
pattern_buffer 是放置編譯後的 Regular Expression buffer 的位址， string 是要尋找尋找的字串， size 是要進行尋找的字串的大小， start 是要進行尋找的字串的開始位址， range 是要進行尋找的範圍， regs 是進行尋找有的 match 情形。
[函數的回傳值：]
"-1" 表示尋找失敗， "-2" 表示內部錯誤， "非負整數" 表示合於尋找的字串的開始位址。
[注意事項：]
range的值可以是負的數值，負的數值代表是從start的位址往前尋找。另外如果start值沒有介在0與size之間，則永遠會比對失敗。
[函數使用的範例：]
以下是進行尋找 regular expression 為 "[Ff]oo" 的範例
/*
   n              : 尋找的結果值
   textstring     : 要被 "[Ff]oo" 進行尋找的字串
   regs           : 是進行尋找過程中所有的 match 情形
   pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構
**/
  int n;
  const char *textstring;
  struct re_registers regs;
  struct re_pattern_buffer pattern_buffer;

/* 從 teststring 的啟始位置起到最後的位置止來進行尋找 */

n = re_search( &pattern_buffer,
               textstring,
               strlen(textstring),
               0,
               strlen(textstring),
               &regs);
4.1.1.4.在兩組字串進行比對的函數
在兩組字串進行比對與前面的進行比對是相類似的，只是可以一次在兩組字串中進行比對。

[函數的介面：]
int re_match_2(struct re_pattern_buffer *pattern_buffer,
               const char *string1,
               const int size1,
               const char *string2,
               const int size2,
               const int start,
               struct re_registers *regs,
               const int stop)
[函數的參數：]
pattern_buffer 是放置編譯後的 Regular Expression buffer 的位址， string1 是要進行比對的字串一， size1 是要進行比對的字串一的大小， string2 是要進行比對的字串二， size2 是要進行比對的字串二的大小， start 是要進行比對的字串的開始位址， regs 是進行比對過程所有的 match 情形， stop 是要進行比對的字串的結束位址。
[函數的回傳值：]
"-1" 表示比對失敗， "-2" 表示內部錯誤， "非負整數" 表示合於比對的個數。
[注意事項：]
與 4.1.1.2.節相同。
[函數使用的範例：]
以下是進行比對 regular expression 為 "[Ff]oo" 的範例
/*
   n              : 比對的結果值
   string1        : 要被 "[Ff]oo" 進行比對的字串一
   string2        : 要被 "[Ff]oo" 進行比對的字串二
   regs           : 是進行比對過程中所有的match情形
   pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構
**/
  int n;
  const char *string1;
  const char *string2;
  struct re_registers regs;
  struct re_pattern_buffer pattern_buffer;

/* 從 string1 的啟始位置起到 string2 的最後位置止來進行比對 */

n = re_match_2( &pattern_buffer,
                string1,
                strlen(string1),
                string2,
                strlen(string2),
                0,
                &regs,
                strlen(string1)+strlen(string2));
4.1.1.5.在兩組字串進行尋找的函數
在兩組字串進行尋找與前面的進行尋找是相類似的，只是可以一次在兩組字串中進行尋找。

[函數的介面：]
int re_search_2(struct re_pattern_buffer *pattern_buffer,
const char *string1,
const int size1,
const char *string2,
const int size2,
const int start,
const int range,
struct re_registers *regs,
const int stop)
[函數的參數：]
pattern_buffer 是放置編譯後的 Regular Expression buffer 的位址， string1 是要尋找尋找的字串一， size1 是要進行尋找的字串一的大小， string2 是要尋找尋找的字串二， size2 是要進行尋找的字串二的大小， start 是要進行尋找的字串的開始位址， range 是要進行尋找的範圍， regs 是進行尋找有的 match 情形， stop 是要進行比對的字串的結束位址。
[函數的回傳值：]
"-1" 表示尋找失敗， "-2" 表示內部錯誤， "非負整數" 表示合於尋找的字串的開始位址。
[注意事項：]
與 4.1.1.3.節相同。
[函數使用的範例：]
以下是進行尋找 regular expression 為 "[Ff]oo" 的範例
/*
   n              : 尋找的結果值
   string1        : 要被 "[Ff]oo" 進行尋找的字串一
   string2        : 要被 "[Ff]oo" 進行尋找的字串二
   regs           : 是進行尋找過程中所有的 match 情形
   pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構
**/
  int n;
  const char *string1;
  const char *string2;
  struct re_registers regs;
  struct re_pattern_buffer pattern_buffer;

/* 從 string1 的啟始位置起到 string2 的最後位置止來進行尋找 */

n = re_match_2( &pattern_buffer,
                 string1,
                 strlen(string1),
                 string2,
                 strlen(string2),
                 0,
                 strlen(string1)+strlen(string2),
                 &regs,
                 strlen(string1)+strlen(string2));
4.1.1.6. 使用fastmap編譯Regular Expression的函數
當在一很長的字串中尋找時，最好是使用 fastmap 來編譯 Regular Expression ，否則尋找的速度會很慢。

[函數的介面：]
int re_compile_fastmap(struct re_pattern_buffer *pattern_buffer)
[函數的參數：]
pattern_buffer 是放置編譯後的 Regular Expression buffer 的位址。
[函數的回傳值：]
"0" 表示編譯成功， "-2" 表示內部錯誤。
[注意事項：]
pattern buffer 的 fastmap 只需 initialize 一次即可。
[函數使用的範例：]
以下是以 fastmap 來編譯 regular expression 為 "[Ff]oo" 的範例
/*
   regex_pattern  : Regular Expression 的表示式字串
   pattern_buffer : 給 regex 函數使用的 pattern buffer 結構
   errcode        : 編譯的結果字串
   fastmap        : fastmap 所使用的空間
   n              : 使用 fastmap 編譯的結果
**/
  const char regex_pattern[7] = "[Ff]oo";
  struct re_pattern_buffer pattern_buffer;
  const char *errcode;
  char fastmap[1 << 8];
  int n;

/* initialize pattern buffer 結構 */
  pattern_buffer.allocated = 0;
  pattern_buffer.buffer = 0;
  pattern_buffer.fastmap = fastmap;
  pattern_buffer.translate = 0;

/* 設定語法的定義 */
  re_syntax_options = RE_SYNTAX_EGREP;

/* 進行編譯 Regular Expression */

errcode = re_compile_pattern( regex_pattern,
                              strlen(regex_pattern),
                              &pattern_buffer);
/* 使用 fastmap 進行編譯 */
n = re_compile_fastmap( &pattern_buffer );

--------------------------------------------------------------------------------
4.1.2 POSIX相容介面的函數
POSIX相容介面的GNU Regex函數共有四個：

編譯 Regular Expression 的函數：regcomp()
進行尋找的函數：regexec()
錯誤報告的函數：regerror()
釋放編譯過 Regular Expression buffer 的函數：regfree()
以下便就這些函數加以說明：
4.1.2.1. 編譯Regular Expression的函數

與使用 GNU 介面的 Regex 函數一樣，使用 POSIX 相容介面的 Regex 函數的第一項工做便是編譯 Regular Expression。本函數中的 regular expression pattern buffer 結構 regex_t ，與前面 GNU 相容介面中的結構 re_pattern_buffer 是完全相等的。

[函數的介面：]
int regcomp(regex_t *preg,
            const char *regex,
            int cflags)
[函數的參數：]
preg 是準備要放置編譯後的 regular expression pattern buffer 的位址， regex 是要編譯的 regular expression 的位址， cflags 是編譯的 flag 。
[函數的回傳值：]
"0" 表示編譯成功， "非0" 表示編譯失敗的錯誤代碼。
[注意事項：]
cflags 的設法： REG_EXTENDED：如果設定的話，表示使用 POSIX Extended 的 Regular Expression 語法，否則是使用 POSIX Basic 的 Regular Expression 語法。 REG_ICASE：如果設定的話，表示乎略大小寫的不同。 REG_NOSUB：如果設定的話，表示不記錄 substring 的 match。 REG_NEWLINE：如果設定的話，表示 . 不會 match newline ， ^ match newline 之後的第一個位置， $ match newline 之前的第一個位置。
[函數使用的範例：]
以下是編譯 regular expression 為 "[Ff]oo" 的範例
/*
   pattern_buffer : 給 regex 函數使用的 pattern buffer 結構
   regex          : Regular Expression 的表示式字串
   cflags         : 編譯的 flag 代碼
   errcode        : 編譯的結果代碼
**/
  regex_t pattern_buffer;
  char regex[7] = "[Ff]oo";
  int cflags;
  int errcode;

/* 設定編譯的 flag 為 REG_NEWLINE */
  cflags = REG_NEWLINE;
/* 編譯 regular expression */

errcode = regcomp( &pattern_buffer,
                   regex,
                   cflags);
4.1.2.2.進行尋找的函數
在編譯過 Regular Expression 的表示式後，就可以進行 pattern 的尋找。 POSIX 相容介面的尋找函數功能遠少於 GNU 介面的尋找函數功能，使用 POSIX 相容介面的尋找函數無法指定由字串的某特定位置尋找起，只能從字串的起始位置尋找起。而且只能回覆是否有合於尋找的子字串，並不能回覆合於尋找的子字串的起始位置。

[函數的介面：]
int regexec(const regex_t *preg,
            const char *string,
            size_t nmatch,
            regmatch_t pmatch[],
            int eflags)
[函數的參數：]
preg 是放置編譯後的 Regular Expression buffer 的位址， string 是要尋找的字串， nmatch 是記錄進行尋找過程中所有的 match 情形的大小， pmatch 是記錄進行尋找過程中所有的 match 情形， eflags 是執行尋找的 flag。
[函數的回傳值：]
"0" 表示尋找成功， "非0" 表示尋找失敗的錯誤代碼。
[注意事項：]
如果 compilation flag 設定了 REG_NOSUB 的話， pmatch 就會被忽略不用。 execution flag 共有兩個設定值， REG_NOTBOL 表示 ^ 不會有作用， REG_NOTEOL 表示 $ 不會有作用。
[函數使用的範例：]
以下是進行尋找 regular expression 為 "[Ff]oo" 的範例
/*
   pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構
   text           : 要被 "[Ff]oo" 進行尋找的字串
   eflag          : 執行尋找的 flag 代碼
   n              : 進行尋找的結果代碼
**/
  regex_t pattern_buffer;
  char *text;
  int eflag;
  int n;

/* 不設定任何執行尋找的 flag */
  eflag = 0;
/* 以乎略進行尋找過程中所有的 match 情形來進行尋找 */

n = regexec(&pattern_buffer,
            text,
            0,
            0,
            eflag);
4.1.2.3. 錯誤報告的函數
若是在編譯 Regular Expression 時發生錯誤，或者在進行尋找時有錯誤發生，所得到的只是錯誤的代碼，而非字串訊息。如果想要得到相對應錯誤代碼的訊息字串的話，那就得呼叫錯誤報告的函數來產生。

[函數的介面：]
size_t regerror(int errcode,
                const regex_t *preg,
                char *errbuf,
                size_t errbuf_size)
[函數的參數：]
errcode 是錯誤的代碼， preg 是引起錯誤的 Regular Expression buffer 的位址， errbuf 是放置錯誤訊息的 buffer ， errbuf_size 是放置錯誤訊息的 buffer 的大小。
[函數的回傳值：]
失敗的錯誤訊息的長度大小。
[注意事項：]
如果不知到要使用多大的 error buffer ，可以直接傳 null errbuf 和 errbuf_size = 0 到函數中。
[函數使用的範例：]
以下是處理編譯 regular expression 為 "[Ff]oo" 的過程是否有錯誤發生的範例
/*
   pattern_buffer : 給 regex 函數使用的 pattern buffer 結構
   regex          : Regular Expression 的表示式字串
   cflags         : 編譯的 flag 代碼
   errcode        : 編譯的結果代碼
   buf            : 編譯的錯誤訊息
**/
  regex_t pattern_buffer;
  int errcode;
  char buf[256];
  char regex[7] = "[Ff]oo";
  int cflags;

/* 設定編譯的 flag 為 REG_NEWLINE */
  cflags = REG_NEWLINE;
/* 編譯 regular expression */

errcode = regcomp( &pattern_buffer,
                   regex,
                   cflags);
/* 處理編譯的過程是否有錯誤發生 */
if ( errcode != 0 ) {
  regerror( errcode,
            pattern_buffer,
            buf,
            sizeof(buf));
  printf(" error : %s\n", buf);
}
4.1.2.4. 釋放編譯過Regular Expression buffer的函數
使用過 POSIX 相容介面的 Regex 的函數後，若不會再使用的話，可以呼叫本函數，將 pattern buffer 所使用的記憶體空間釋放掉。另外本函數也可以供 GNU 介面 Regex 的函數所使用，因為 POSIX 介面中的 pattern buffer 結構 regex_t 與 GNU 介面中的 pattern buffer 結構 re_pattern_buffer 是相同的。

[函數的介面]：
void regfree(regex_t *preg)
[函數的參數]：
preg 是要釋放編譯過的 Regular Expression buffer 的位址。
[函數的回傳值]：
無。
[注意事項]：
如果呼叫過本函數後還要再使用尋找的功能，就必需再重新編譯 Regular Expression 才行。
[函數使用的範例]：
以下是釋放編譯過的 regular expression 為 "[Ff]oo" 的範例
/*
  pattern_buffer : 經編譯過的 "[Ff]oo" 的 pattern buffer 結構
**/
  regex_t pattern_buffer;

/* 釋放編譯過的 regular expression pattern buffer */
  regfree(&pattern_buffer);

--------------------------------------------------------------------------------
4.1.3 BSD相容介面的函數
BSD 相容介面的Regex 函數只有兩個，非常簡單明暸：

編譯 Regular Expression 的函數：re_comp()
進行尋找的函數：re_exec()
以下便就這兩個函數加以說明：
4.1.3.1.編譯Regular Expression的函數

同樣地，使用 BSD 相容介面 regex 函數庫的第一件工作，也就是把 Regular Expression 的表示式字串，編譯成 BSD 相容介面 regex 函數程式可以使用的 pattern buffer 。但是因為 BSD 介面使用用內部的 pattern buffer ，所以使用者可以不用考慮 pattern buffer 的設定等問題，只要簡單地把所要進行編譯的表示式字串傳入編譯 Regular Expression 的函數中即可。

[函數的介面：]
char *re_comp(char *regex)
[函數的參數：]
regex 是要編譯的 regular expression 的位址。
[函數的回傳值：]
"0" 表示編譯成功， "非空字串" 表示編譯失敗的錯誤字串。
[注意事項：]
可以用設定 re_syntax_options 的方式來控制 regular expression 的編譯語法。另外本函數是使用內部的 pattern buffer ，所以只能保留最後一次的 pattern buffer。如果給的參數 regex 是一 null 字串的話， re_comp 並不會改變 pattern buffer。
[函數使用的範例：]
以下是編譯 regular expression 為 "[Ff]oo" 的範例
/*
   regex          : Regular Expression 的表示式字串
   err            : 編譯的結果字串
**/
  char regex[7] = "[Ff]oo";
  const char *err;

/* 設定編譯的語法為 REG_SYNTAX_GREP */
  re_syntax_options = RE_SYNTAX_GREP;
/* 編譯 regular expression */
  err = re_comp( regex );
4.1.3.2.進行尋找的函數
在編譯過 Regular Expression 的表示式後，就可以進行 pattern 的尋找。因為 BSD 相容介面使用內部設定的方式，所以只要簡單地把所要進行尋找的字串傳入尋找的函數中即可。但是也因功能簡單，所以只能得知進行尋找是否成功，而無法得知合於尋找的子字串的起始位置。

[函數的介面：]
int re_exec(char *string)
[函數的參數：]
string 是要進行尋找的字串。
[函數的回傳值：]
"1" 表示尋找成功， "0" 表示尋找失敗。
[注意事項：]
本函數會自動在內部中使用 GNU 的 fastmap 功能。
[函數使用的範例：]
以下是進行尋找 regular expression 為 "[Ff]oo" 的範例
/*
   text           : 要被 "[Ff]oo" 進行尋找的字串
   n              : 尋找的結果
**/
  char *text;
  int n;

/* 進行尋找 */
  n = re_exec( text );

--------------------------------------------------------------------------------
4.2 程式範例
本節根據前面所述的函數，分別就三種介面作一小的程式範例。程式範例所需的測試文件 (testfile) 內容如下：

there and here
where ?
here and there
?here and there
程式範例所要測試執行 match 的 regular expression pattern 為 "出現在行末而且為一單字的here" 與 "空格？" 。測試的程式除了 match regular expression pattern 外，還會把合於 match pattern 的那一行列出來。
4.2.1 GNU特有介面的函數

本 GNU 介面的範例程式會以引數參數的形式，讀入 regular expression ，並且打開測試的文件，然後以每一行為單位，使用 re_match 與 re_search 來作 match 測試，另外以每二行為單位，使用 re_match_2 與 re_search_2 來作 match 測試。範例程式 gnu_regex_test.c 的原始碼如下:

#include <stdio.h>
#include "regex.h"

int gnu_regex(regex_pattern, line1, line2)
char *regex_pattern;
char *line1;
char *line2;
{
  struct re_pattern_buffer pattern_buffer;
  struct re_registers regs;
  int n, len1, len2;
  const char *id;

  len1 = strlen(line1);
  len2 = strlen(line2);

  /* 設定 regular expression 的語法定義 */
  re_syntax_options = RE_SYNTAX_EGREP | RE_INTERVALS | RE_BACKSLASH_ESCAPE_IN_LISTS;
  /* 將 regular expression 的 pattern buffer 初始化 */
  pattern_buffer.allocated = 0;
  pattern_buffer.buffer = 0;
  pattern_buffer.fastmap = 0;
  pattern_buffer.translate = 0;
  /* 編譯 regular expression */
  id = re_compile_pattern( regex_pattern, strlen(regex_pattern), &pattern_buffer);
  /* 偵測是否有錯誤的發生 */
  if (id != NULL) {
     printf(" error on compiling regex1. code = %s\n", id);
     exit(1);
  }
  /* 在字串 line1 中進行比對，並列出其回傳值 */
  n = re_match( &pattern_buffer, line1, len1, 0, &regs);
  printf(" re_match return = %d\n",n);
  /* 在字串 line1 與字串 line2 中進行比對，並列出其回傳值 */
  n = re_match_2( &pattern_buffer, line1, len1, line2, len2, 0, &regs, len1+len2);
  printf(" re_match_2 return = %d\n",n);
  /* 在字串 line1 中進行尋找，並列出其回傳值 */
  n = re_search( &pattern_buffer, line1, len1, 0, len1, &regs);
  printf(" re_search return = %d\n",n);
  if (n >= 0) printf(" re_search string = %s\n",line1);
  /* 在字串 line1 與字串 line2 中進行尋找，並列出其回傳值 */
  n = re_search_2( &pattern_buffer, line1, len1, line2, len2, 0, len1+len2, &regs, len1+len2);
  if (n >= 0) {
    printf(" re_search_2 return = %d\n",n);
    if (n < len1) printf(" re_search_2 string = %s\n",line1);
    else printf(" re_search_2 string = %s\n",line2);
    return 1;
  }
  printf(" re_search_2 return = %d\n",n);
  return n;
}

main(argc,argv)
int argc;
char **argv;
{
  FILE *fp;
  char line[2][1024];
  int i, n, k, j;

  /* 檢查參數的個數 */
  if (argc != 3) {
    printf("Usage: %s pattern file\n",argv[0]);
    exit(1);
  }
  /* 打開測試的文件 */
  fp = fopen(argv[2],"r");
  if (fp == NULL) {
    fprintf(stderr, "Can't open %s.\n", argv[2]);
    exit(1);
  }
  /* 讀取測試文件中的字串並進行 GNU 介面 Regex 程式的測試 */
  j = 1;
  fgets(line[0], 1024, fp);
  i = strlen(line[0]) - 1;
  if (line[0][i] == '\n') { line[0][i] = NULL;}
  while (1) {
    n = j & 1;
    k = n ^ 1;
    if (fgets(line[n], 1024, fp) == NULL) {break;}
    j++;
    i = strlen(line[n]) - 1;
    if (line[n][i] == '\n') { line[n][i] = NULL;}
    gnu_regex(argv[1], line[k], line[n]);
  }
  line[n][0] = NULL;
  gnu_regex(argv[1], line[k], line[n]);
  /* 關閉測試的文件 */
  fclose(fp);
}
執行match "出現在行末而且為一單字的here" 的例子結果：
% gnu_regex_test '\bhere$' testfile
re_match return = -1
re_match_2 return = -1
re_search return = 10
re_search string = there and here
re_search_2 return = -1
re_match return = -1
re_match_2 return = -1
re_search return = -1
re_search_2 return = -1
re_match return = -1
re_match_2 return = -1
re_search return = -1
re_search_2 return = -1
re_match return = -1
re_match_2 return = -1
re_search return = -1
re_search_2 return = -1
執行match "空格？" 的例子結果：
% gnu_regex_test '[[:space:]]\?' testfile
re_match return = -1
re_match_2 return = -1
re_search return = -1
re_search_2 return = 19
re_search_2 string = where ?
re_match return = -1
re_match_2 return = -1
re_search return = 5
re_search string = where ?
re_search_2 return = 5
re_search_2 string = where ?
re_match return = -1
re_match_2 return = -1
re_search return = -1
re_search_2 return = -1
re_match return = -1
re_match_2 return = -1
re_search return = -1
re_search_2 return = -1
4.2.2 POSIX相容介面的函數
本 POSIX 相容介面的範例程式會以引數參數的形式，讀入 regular expression ，並且打開測試的文件，然後以每一行為單位，使用 regexec 來作 match 測試。範例程式 posix_regex_test.c 的原始碼如下:

#include <stdio.h>
#include "regex.h"

/* pattern buffer 的初始化副程式 */
void init_pattern_buffer(pattern_buffer)
regex_t *pattern_buffer;
{
  pattern_buffer->buffer = NULL;
  pattern_buffer->allocated = 0;
  pattern_buffer->used = 0;
  pattern_buffer->fastmap = NULL;
  pattern_buffer->fastmap_accurate = 0;
  pattern_buffer->translate = NULL;
  pattern_buffer->can_be_null = 0;
  pattern_buffer->re_nsub = 0;
  pattern_buffer->no_sub = 0;
  pattern_buffer->not_bol = 0;
  pattern_buffer->not_eol = 0;
}

int test_posix(pattern_buffer, regex, text)
regex_t *pattern_buffer;
char *regex;
char *text;
{
  int cflags, eflag;
  int n;
  int id;
  char buf[256];

  /* 進行 regular expression pattern buffer 的初始化 */
  init_pattern_buffer(pattern_buffer);

  /* 設定 regular expression 的語法定義 */
  cflags = REG_NEWLINE | REG_EXTENDED;
  /* 編譯 regular expression */
  id = regcomp( pattern_buffer, regex, cflags);
  /* 偵測是否有錯誤的發生 */
  if (id != 0) {
    printf(" error on compiling regex. code = %d\n", id);
    regerror( id, pattern_buffer, buf, sizeof(buf));
    printf(" error : %s\n", buf);
    exit(1);
  }
  /* 不設定執行進行尋找的特別功能 */
  eflag = 0;
  /* 在字串 text 中進行尋找，並列出其回傳值 */
  n = regexec(pattern_buffer, text, 0, 0, eflag);
  if (n == 0) {
    printf(" regexec match string = %s\n",text);
  }
  return n;
}

main(argc,argv)
int argc;
char **argv;
{
  FILE *fp;
  char line[1024];
  regex_t pattern_buffer;

  /* 檢查參數的個數 */
  if (argc != 3) {
    printf("Usage: %s pattern file\n",argv[0]);
    exit(1);
  }
  /* 打開測試的文件 */
  fp = fopen(argv[2],"r");
  if (fp == NULL) {
    fprintf(stderr, "Can't open %s.\n", argv[2]);
    exit(1);
  }

  /* 讀取測試文件中的字串並進行 POSIX 介面 Regex 程式的測試 */
  while (fgets(line, 1024, fp) != NULL) {
     test_posix(&pattern_buffer,argv[1],line);
  }

  /* 釋放 regular expression pattern buffer */
  regfree(&pattern_buffer);
  /* 關閉測試的文件 */
  fclose(fp);
}
執行 match "出現在行末而且為一單字的here" 的例子結果：
% posix_regex_test '[[:space:]]here$' testfile
regexec match string = there and here
執行 match "空格？" 的例子結果：
% posix_regex_test '[[:space:]]?' testfile
regexec match string = where ?
4.2.3 BSD相容介面的函數
本 BSD 相容介面的範例程式會以引數參數的形式，讀入 regular expression ，並且打開測試的文件，然後以每一行為單位，使用 re_exec 來作 match 測試。範例程式 bsd_regex_test.c 的原始碼如下:

#include <stdio.h>
#include "regex.h"

int test_bsd(regex, text)
char *regex;
char *text;
{
  int n;
  const char *id;

  re_syntax_options = RE_SYNTAX_GREP;
  /* 編譯 regular expression */
  id = re_comp( regex);
  /* 偵測是否有錯誤的發生 */
  if (id != NULL) {
     printf(" error on compiling regex. code = %s\n", id);
     exit(1);
  }

  /* 在字串 text 中進行尋找，並列出其回傳值 */
  n = re_exec(text);
  if (n == 1) {printf(" re_exec match string = %s\n",text);}
  return n;
}

main(argc,argv)
int argc;
char **argv;
{
  FILE *fp;
  char line[1024];

  /* 檢查參數的個數 */
  if (argc != 3) {
     printf("Usage: %s pattern file\n",argv[0]);
     exit(1);
  }
  /* 打開測試的文件 */
  fp = fopen(argv[2],"r");
  if (fp == NULL) {
     fprintf(stderr, "Can't open %s.\n", argv[2]);
     exit(1);
  }

  /* 讀取測試文件中的字串並進行 BSD 相容介面 Regex 程式的測試 */
  while (fgets(line, 1024, fp) != NULL) {
     test_bsd(argv[1],line);
  }
  /* 關閉測試的文件 */
  fclose(fp);
}
執行 match "出現在行末而且為一單字的here" 的例子結果：
% bsd_regex_test '[[:space:]]here$' testfile
re_exec match string = there and here
執行 match "空格？" 的例子結果：
% bsd_regex_test '[[:space:]]?' testfile
re_exec match string = where ?

--------------------------------------------------------------------------------
5. 附錄
regex.h

/* Definitions for data structures and routines for the regular
   expression library, version 0.12.

   Copyright (C) 1985, 1989, 1990, 1991, 1992, 1993 Free Software Foundation, Inc.

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2, or (at your option)
   any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; if not, write to the Free Software
   Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.  */

#ifndef __REGEXP_LIBRARY_H__
#define __REGEXP_LIBRARY_H__

/* POSIX says that <sys/types.h> must be included (by the caller) before
   <regex.h>.  */

#ifdef VMS
/* VMS doesn't have `size_t' in <sys/types.h>, even though POSIX says it
   should be there.  */
#include <stddef.h>
#endif

/* The following bits are used to determine the regexp syntax we
   recognize.  The set/not-set meanings are chosen so that Emacs syntax
   remains the value 0.  The bits are given in alphabetical order, and
   the definitions shifted by one from the previous bit; thus, when we
   add or remove a bit, only one other definition need change.  */
typedef unsigned reg_syntax_t;

/* If this bit is not set, then \ inside a bracket expression is literal.
   If set, then such a \ quotes the following character.  */
#define RE_BACKSLASH_ESCAPE_IN_LISTS (1)

/* If this bit is not set, then + and ? are operators, and \+ and \? are
     literals.
   If set, then \+ and \? are operators and + and ? are literals.  */
#define RE_BK_PLUS_QM (RE_BACKSLASH_ESCAPE_IN_LISTS << 1)

/* If this bit is set, then character classes are supported.  They are:
     [:alpha:], [:upper:], [:lower:],  [:digit:], [:alnum:], [:xdigit:],
     [:space:], [:print:], [:punct:], [:graph:], and [:cntrl:].
   If not set, then character classes are not supported.  */
#define RE_CHAR_CLASSES (RE_BK_PLUS_QM << 1)

/* If this bit is set, then ^ and $ are always anchors (outside bracket
     expressions, of course).
   If this bit is not set, then it depends:
        ^  is an anchor if it is at the beginning of a regular
           expression or after an open-group or an alternation operator;
        $  is an anchor if it is at the end of a regular expression, or
           before a close-group or an alternation operator.

   This bit could be (re)combined with RE_CONTEXT_INDEP_OPS, because
   POSIX draft 11.2 says that * etc. in leading positions is undefined.
   We already implemented a previous draft which made those constructs
   invalid, though, so we haven't changed the code back.  */
#define RE_CONTEXT_INDEP_ANCHORS (RE_CHAR_CLASSES << 1)

/* If this bit is set, then special characters are always special
     regardless of where they are in the pattern.
   If this bit is not set, then special characters are special only in
     some contexts; otherwise they are ordinary.  Specifically,
     * + ? and intervals are only special when not after the beginning,
     open-group, or alternation operator.  */
#define RE_CONTEXT_INDEP_OPS (RE_CONTEXT_INDEP_ANCHORS << 1)

/* If this bit is set, then *, +, ?, and { cannot be first in an re or
     immediately after an alternation or begin-group operator.  */
#define RE_CONTEXT_INVALID_OPS (RE_CONTEXT_INDEP_OPS << 1)

/* If this bit is set, then . matches newline.
   If not set, then it doesn't.  */
#define RE_DOT_NEWLINE (RE_CONTEXT_INVALID_OPS << 1)

/* If this bit is set, then . doesn't match NUL.
   If not set, then it does.  */
#define RE_DOT_NOT_NULL (RE_DOT_NEWLINE << 1)

/* If this bit is set, nonmatching lists [^...] do not match newline.
   If not set, they do.  */
#define RE_HAT_LISTS_NOT_NEWLINE (RE_DOT_NOT_NULL << 1)

/* If this bit is set, either \{...\} or {...} defines an
     interval, depending on RE_NO_BK_BRACES.
   If not set, \{, \}, {, and } are literals.  */
#define RE_INTERVALS (RE_HAT_LISTS_NOT_NEWLINE << 1)

/* If this bit is set, +, ? and | aren't recognized as operators.
   If not set, they are.  */
#define RE_LIMITED_OPS (RE_INTERVALS << 1)

/* If this bit is set, newline is an alternation operator.
   If not set, newline is literal.  */
#define RE_NEWLINE_ALT (RE_LIMITED_OPS << 1)

/* If this bit is set, then `{...}' defines an interval, and \{ and \}
     are literals.
  If not set, then `\{...\}' defines an interval.  */
#define RE_NO_BK_BRACES (RE_NEWLINE_ALT << 1)

/* If this bit is set, (...) defines a group, and $ and $ are literals.
   If not set, $...$ defines a group, and ( and ) are literals.  */
#define RE_NO_BK_PARENS (RE_NO_BK_BRACES << 1)

/* If this bit is set, then \<digit> matches <digit>.
   If not set, then \<digit> is a back-reference.  */
#define RE_NO_BK_REFS (RE_NO_BK_PARENS << 1)

/* If this bit is set, then | is an alternation operator, and \| is literal.
   If not set, then \| is an alternation operator, and | is literal.  */
#define RE_NO_BK_VBAR (RE_NO_BK_REFS << 1)

/* If this bit is set, then an ending range point collating higher
     than the starting range point, as in [z-a], is invalid.
   If not set, then when ending range point collates higher than the
     starting range point, the range is ignored.  */
#define RE_NO_EMPTY_RANGES (RE_NO_BK_VBAR << 1)

/* If this bit is set, then an unmatched ) is ordinary.
   If not set, then an unmatched ) is invalid.  */
#define RE_UNMATCHED_RIGHT_PAREN_ORD (RE_NO_EMPTY_RANGES << 1)

/* This global variable defines the particular regexp syntax to use (for
   some interfaces).  When a regexp is compiled, the syntax used is
   stored in the pattern buffer, so changing this does not affect
   already-compiled regexps.  */
extern reg_syntax_t re_syntax_options;

/* Define combinations of the above bits for the standard possibilities.
   (The [[[ comments delimit what gets put into the Texinfo file, so
   don't delete them!)  */
/* [[[begin syntaxes]]] */
#define RE_SYNTAX_EMACS 0

#define RE_SYNTAX_AWK                                                   \
  (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL                       \
   | RE_NO_BK_PARENS            | RE_NO_BK_REFS                         \
   | RE_NO_BK_VBAR               | RE_NO_EMPTY_RANGES                   \
   | RE_UNMATCHED_RIGHT_PAREN_ORD)

#define RE_SYNTAX_POSIX_AWK                                             \
  (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)

#define RE_SYNTAX_GREP                                                  \
  (RE_BK_PLUS_QM              | RE_CHAR_CLASSES                         \
   | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS                            \
   | RE_NEWLINE_ALT)

#define RE_SYNTAX_EGREP                                                 \
  (RE_CHAR_CLASSES        | RE_CONTEXT_INDEP_ANCHORS                    \
   | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE                    \
   | RE_NEWLINE_ALT       | RE_NO_BK_PARENS                             \
   | RE_NO_BK_VBAR)

#define RE_SYNTAX_POSIX_EGREP                                           \
  (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)

/* P1003.2/D11.2, section 4.20.7.1, lines 5078ff.  */
#define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC

#define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC

/* Syntax bits common to both basic and extended POSIX regex syntax.  */
#define _RE_SYNTAX_POSIX_COMMON                                         \
  (RE_CHAR_CLASSES | RE_DOT_NEWLINE      | RE_DOT_NOT_NULL              \
   | RE_INTERVALS  | RE_NO_EMPTY_RANGES)

#define RE_SYNTAX_POSIX_BASIC                                           \
  (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)

/* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
   RE_LIMITED_OPS, i.e., \? \+ \| are not recognized.  Actually, this
   isn't minimal, since other operators, such as \`, aren't disabled.  */
#define RE_SYNTAX_POSIX_MINIMAL_BASIC                                   \
  (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)

#define RE_SYNTAX_POSIX_EXTENDED                                        \
  (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS                   \
   | RE_CONTEXT_INDEP_OPS  | RE_NO_BK_BRACES                            \
   | RE_NO_BK_PARENS       | RE_NO_BK_VBAR                              \
   | RE_UNMATCHED_RIGHT_PAREN_ORD)

/* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
   replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added.  */
#define RE_SYNTAX_POSIX_MINIMAL_EXTENDED                                \
  (_RE_SYNTAX_POSIX_COMMON  | RE_CONTEXT_INDEP_ANCHORS                  \
   | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES                           \
   | RE_NO_BK_PARENS        | RE_NO_BK_REFS                             \
   | RE_NO_BK_VBAR          | RE_UNMATCHED_RIGHT_PAREN_ORD)
/* [[[end syntaxes]]] */

/* Maximum number of duplicates an interval can allow.  Some systems
   (erroneously) define this in other header files, but we want our
   value, so remove any previous define.  */
#ifdef RE_DUP_MAX
#undef RE_DUP_MAX
#endif
#define RE_DUP_MAX ((1 << 15) - 1)

/* POSIX `cflags' bits (i.e., information for `regcomp').  */

/* If this bit is set, then use extended regular expression syntax.
   If not set, then use basic regular expression syntax.  */
#define REG_EXTENDED 1

/* If this bit is set, then ignore case when matching.
   If not set, then case is significant.  */
#define REG_ICASE (REG_EXTENDED << 1)

/* If this bit is set, then anchors do not match at newline
     characters in the string.
   If not set, then anchors do match at newlines.  */
#define REG_NEWLINE (REG_ICASE << 1)

/* If this bit is set, then report only success or fail in regexec.
   If not set, then returns differ between not matching and errors.  */
#define REG_NOSUB (REG_NEWLINE << 1)

/* POSIX `eflags' bits (i.e., information for regexec).  */

/* If this bit is set, then the beginning-of-line operator doesn't match
     the beginning of the string (presumably because it's not the
     beginning of a line).
   If not set, then the beginning-of-line operator does match the
     beginning of the string.  */
#define REG_NOTBOL 1

/* Like REG_NOTBOL, except for the end-of-line.  */
#define REG_NOTEOL (1 << 1)

/* If any error codes are removed, changed, or added, update the
   `re_error_msg' table in regex.c.  */
typedef enum
{
  REG_NOERROR = 0,      /* Success.  */
  REG_NOMATCH,          /* Didn't find a match (for regexec).  */

  /* POSIX regcomp return error codes.  (In the order listed in the
     standard.)  */
  REG_BADPAT,           /* Invalid pattern.  */
  REG_ECOLLATE,         /* Not implemented.  */
  REG_ECTYPE,           /* Invalid character class name.  */
  REG_EESCAPE,          /* Trailing backslash.  */
  REG_ESUBREG,          /* Invalid back reference.  */
  REG_EBRACK,           /* Unmatched left bracket.  */
  REG_EPAREN,           /* Parenthesis imbalance.  */
  REG_EBRACE,           /* Unmatched \{.  */
  REG_BADBR,            /* Invalid contents of \{\}.  */
  REG_ERANGE,           /* Invalid range end.  */
  REG_ESPACE,           /* Ran out of memory.  */
  REG_BADRPT,           /* No preceding re for repetition op.  */

  /* Error codes we've added.  */
  REG_EEND,             /* Premature end.  */
  REG_ESIZE,            /* Compiled pattern bigger than 2^16 bytes.  */
  REG_ERPAREN           /* Unmatched ) or \); not returned from regcomp.  */
} reg_errcode_t;

/* This data structure represents a compiled pattern.  Before calling
   the pattern compiler, the fields `buffer', `allocated', `fastmap',
   `translate', and `no_sub' can be set.  After the pattern has been
   compiled, the `re_nsub' field is available.  All other fields are
   private to the regex routines.  */

struct re_pattern_buffer
{
/* [[[begin pattern_buffer]]] */
        /* Space that holds the compiled pattern.  It is declared as
          `unsigned char *' because its elements are
           sometimes used as array indexes.  */
  unsigned char *buffer;

        /* Number of bytes to which `buffer' points.  */
  unsigned long allocated;

        /* Number of bytes actually used in `buffer'.  */
  unsigned long used;

        /* Syntax setting with which the pattern was compiled.  */
  reg_syntax_t syntax;

        /* Pointer to a fastmap, if any, otherwise zero.  re_search uses
           the fastmap, if there is one, to skip over impossible
           starting points for matches.  */
  char *fastmap;

        /* Either a translate table to apply to all characters before
           comparing them, or zero for no translation.  The translation
           is applied to a pattern when it is compiled and to a string
           when it is matched.  */
  char *translate;

        /* Number of subexpressions found by the compiler.  */
  size_t re_nsub;

        /* Zero if this pattern cannot match the empty string, one else.
           Well, in truth it's used only in `re_search_2', to see
           whether or not we should use the fastmap, so we don't set
           this absolutely perfectly; see `re_compile_fastmap' (the
           `duplicate' case).  */
  unsigned can_be_null : 1;

        /* If REGS_UNALLOCATED, allocate space in the `regs' structure
             for `max (RE_NREGS, re_nsub + 1)' groups.
           If REGS_REALLOCATE, reallocate space if necessary.
           If REGS_FIXED, use what's there.  */
#define REGS_UNALLOCATED 0
#define REGS_REALLOCATE 1
#define REGS_FIXED 2
  unsigned regs_allocated : 2;

        /* Set to zero when `regex_compile' compiles a pattern; set to one
           by `re_compile_fastmap' if it updates the fastmap.  */
  unsigned fastmap_accurate : 1;

        /* If set, `re_match_2' does not return information about
           subexpressions.  */
  unsigned no_sub : 1;

        /* If set, a beginning-of-line anchor doesn't match at the
           beginning of the string.  */
  unsigned not_bol : 1;

        /* Similarly for an end-of-line anchor.  */
  unsigned not_eol : 1;

        /* If true, an anchor at a newline matches.  */
  unsigned newline_anchor : 1;

/* [[[end pattern_buffer]]] */
};

typedef struct re_pattern_buffer regex_t;

/* search.c (search_buffer) in Emacs needs this one opcode value.  It is
   defined both in `regex.c' and here.  */
#define RE_EXACTN_VALUE 1

/* Type for byte offsets within the string.  POSIX mandates this.  */
typedef int regoff_t;

/* This is the structure we store register match data in.  See
   regex.texinfo for a full description of what registers match.  */
struct re_registers
{
  unsigned num_regs;
  regoff_t *start;
  regoff_t *end;
};

/* If `regs_allocated' is REGS_UNALLOCATED in the pattern buffer,
   `re_match_2' returns information about at least this many registers
   the first time a `regs' structure is passed.  */
#ifndef RE_NREGS
#define RE_NREGS 30
#endif

/* POSIX specification for registers.  Aside from the different names than
   `re_registers', POSIX uses an array of structures, instead of a
   structure of arrays.  */
typedef struct
{
  regoff_t rm_so;  /* Byte offset from string's start to substring's start.  */
  regoff_t rm_eo;  /* Byte offset from string's start to substring's end.  */
} regmatch_t;

/* Declarations for routines.  */

/* To avoid duplicating every routine declaration -- once with a
   prototype (if we are ANSI), and once without (if we aren't) -- we
   use the following macro to declare argument types.  This
   unfortunately clutters up the declarations a bit, but I think it's
   worth it.  */

#if __STDC__

#define _RE_ARGS(args) args

#else /* not __STDC__ */

#define _RE_ARGS(args) ()

#endif /* not __STDC__ */

/* Sets the current default syntax to SYNTAX, and return the old syntax.
   You can also simply assign to the `re_syntax_options' variable.  */
extern reg_syntax_t re_set_syntax _RE_ARGS ((reg_syntax_t syntax));

/* Compile the regular expression PATTERN, with length LENGTH
   and syntax given by the global `re_syntax_options', into the buffer
   BUFFER.  Return NULL if successful, and an error string if not.  */
extern const char *re_compile_pattern
  _RE_ARGS ((const char *pattern, int length,
             struct re_pattern_buffer *buffer));

/* Compile a fastmap for the compiled pattern in BUFFER; used to
   accelerate searches.  Return 0 if successful and -2 if was an
   internal error.  */
extern int re_compile_fastmap _RE_ARGS ((struct re_pattern_buffer *buffer));

/* Search in the string STRING (with length LENGTH) for the pattern
   compiled into BUFFER.  Start searching at position START, for RANGE
   characters.  Return the starting position of the match, -1 for no
   match, or -2 for an internal error.  Also return register
   information in REGS (if REGS and BUFFER->no_sub are nonzero).  */
extern int re_search
  _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string,
            int length, int start, int range, struct re_registers *regs));

/* Like `re_search', but search in the concatenation of STRING1 and
   STRING2.  Also, stop searching at index START + STOP.  */
extern int re_search_2
  _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string1,
             int length1, const char *string2, int length2,
             int start, int range, struct re_registers *regs, int stop));

/* Like `re_search', but return how many characters in STRING the regexp
   in BUFFER matched, starting at position START.  */
extern int re_match
  _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string,
             int length, int start, struct re_registers *regs));

/* Relates to `re_match' as `re_search_2' relates to `re_search'.  */
extern int re_match_2
  _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string1,
             int length1, const char *string2, int length2,
             int start, struct re_registers *regs, int stop));

/* Set REGS to hold NUM_REGS registers, storing them in STARTS and
   ENDS.  Subsequent matches using BUFFER and REGS will use this memory
   for recording register information.  STARTS and ENDS must be
   allocated with malloc, and must each be at least `NUM_REGS * sizeof
   (regoff_t)' bytes long.

   If NUM_REGS == 0, then subsequent matches should allocate their own
   register data.

   Unless this function is called, the first search or match using
   PATTERN_BUFFER will allocate its own register data, without
   freeing the old data.  */
extern void re_set_registers
  _RE_ARGS ((struct re_pattern_buffer *buffer, struct re_registers *regs,
             unsigned num_regs, regoff_t *starts, regoff_t *ends));

/* 4.2 bsd compatibility.  */
extern char *re_comp _RE_ARGS ((const char *));
extern int re_exec _RE_ARGS ((const char *));

/* POSIX compatibility.  */
extern int regcomp _RE_ARGS ((regex_t *preg, const char *pattern, int cflags));
extern int regexec
  _RE_ARGS ((const regex_t *preg, const char *string, size_t nmatch,
             regmatch_t pmatch[], int eflags));
extern size_t regerror
  _RE_ARGS ((int errcode, const regex_t *preg, char *errbuf,
             size_t errbuf_size));
extern void regfree _RE_ARGS ((regex_t *preg));

#endif /* not __REGEXP_LIBRARY_H__ */

/*
Local variables:
make-backup-files: t
version-control: t
trim-versions-without-asking: nil
End:
*/
--------------------------------------------------------------------------------

6. 參考文件
``GNU Regex Document'', Free Software Foundation, Inc., September 1992.
``Regular Expression Introduction'', 中央研究院計算中心, ASPAC計劃, 1995.

posted on 2012-03-19 16:36 clq 阅读(891) 评论(0) 收藏举报

刷新页面返回顶部