.NET Framework 类库 Regex 类

文章节选自：MSDN .NET Framework 类库 Regex 类 http://msdn.microsoft.com/zh-cn/library/system.text.regularexpressions.regex

静态与实例方法

执行正则表达式操作

示例

1、使用正则表达式检查字符串中重复出现的词

2、使用正则表达式来检查字符串是表示货币值还是具有表示货币值的正确格式

3、提取字符串内重复的单词信息（自写）

Regex 类表示 .NET Framework 的正则表达式引擎。它可用来快速分析大量的文本，以查找特定字符模式；提取、编辑、替换或删除文本子字符串；或将提取的字符串添加到集合中，以便生成报告。

静态与实例方法

定义正则表达式模式之后，可以使用以下两种方式之一将其提供给正则表达式引擎。

实例化表示正则表达式的 Regex 对象。若要执行此操作，应将正则表达式模式传递给 Regex 构造函数。Regex 对象是不可变的；当您使用正则表达式实例化 Regex 对象时，将无法更改该对象的正则表达式。
向 static（在 Visual Basic 中为 Shared）Regex 方法同时提供正则表达式和要搜索的文本。这使您无需显式创建 Regex 对象即可使用正则表达式。

所有 Regex 模式标识方法均同时包括静态重载和实例重载。

正则表达式引擎必须编译特定的模式，然后才可以使用该模式。因为 Regex 对象不可变，这是调用 Regex 类构造函数或静态方法时发生的一次性过程。为了避免重复编译单个正则表达式，正则表达式引擎将缓存在静态方法调用中所使用的已编译正则表达式。因此，正则表达式模式匹配方法为静态方法和实例方法提供了同等的性能。

重要事项
在 .NET Framework 版本 1.0 和 1.1 中，所有已编译的正则表达式都会被缓存，而不论它们是在实例中使用还是静态方法调用。从 .NET Framework 2.0 开始，只有静态方法调用中使用的正则表达式才会被缓存。

但是，由正则表达式引擎实现的缓存系统在以下两种情况下可能对性能产生不利影响：

当使用大量的正则表达式进行静态方法调用时。默认情况下，正则表达式引擎将缓存 15 个最近使用的静态正则表达式。如果应用程序使用的静态正则表达式超过 15 个，则必须重新编译某些正则表达式。为了防止执行此类重新编译，您可以将 Regex.CacheSize 属性增加到适当的值。
当应用程序使用先前已编译的正则表达式实例化新的 Regex 对象时。

执行正则表达式操作

无论您决定是实例化一个 Regex 对象并调用其方法，还是调用静态方法，Regex 类都将提供以下模式匹配功能：

验证匹配。您可以调用 IsMatch 方法以确定是否存在匹配。
检索单个匹配。您可以调用 Match 方法来检索 Match 对象，该对象表示字符串或字符串一部分中的第一个匹配项。后续匹配项可以通过调用 Match.NextMatch 方法进行检索。
检索所有匹配。您可以调用 Matches 方法来检索 System.Text.RegularExpressions.MatchCollection 对象，该对象表示在字符串或字符串一部分中找到的所有匹配项。
替换匹配的文本。您可以调用 Replace 方法来替换匹配的文本。此替换文本还可通过正则表达式来定义。此外，某些 Replace 方法包括一个 MatchEvaluator 参数，该参数使您能够以编程方式定义替换文本。
创建字符串数组，该数组是由输入字符串的各个部分构成。您可以调用 Split 方法，在正则表达式定义的位置拆分输入字符串。

除了其匹配模式方法之外，Regex 类还包括几种特殊用途的方法：

Escape 方法可以对任何在正则表达式或输入字符串中可能被解释为正则表达式运算符的字符进行转义。
Unescape 方法移除这些转义字符。
CompileToAssembly 方法创建一个包含预定义正则表达式的程序集。.NET Framework 在 System.Web.RegularExpressions 命名空间中包含这些特殊用途的程序集的示例。

示例

1、下面的示例使用正则表达式检查字符串中重复出现的词。正则表达式 \b(?<word>\w+)\s+(\k<word>)\b 可按下表中的方式解释。

\b                            从单词边界开始匹配。
(?<word>\w+)          匹配一个或多个单词字符（最多可到单词边界）。将此捕获组命名为 word。
\s+                          匹配一个或多个空白字符。
(\k<word>)              匹配名为 word 的捕获组。
\b                            与字边界匹配。

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}", 
                          matches.Count, 
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value, 
                              groups[0].Index, 
                              groups[1].Index);
        }

    }
	
}
// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazy dog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54

2、使用正则表达式来检查字符串是表示货币值还是具有表示货币值的正确格式。在这种情况下，将从用户的当前区域性的 NumberFormatInfo.CurrencyDecimalSeparator、CurrencyDecimalDigits、NumberFormatInfo.CurrencySymbol、NumberFormatInfo.NegativeSign 和 NumberFormatInfo.PositiveSign 属性中动态生成正则表达式。如果系统的当前区域性为 en-US，导致的正则表达式将是 ^\w*[\+-]?\w?\$?\w?(\d*\.?\d{2}?){1}$.此正则表达式可按下表中所示进行解释。

^                            在字符串的开头处开始。
\w*                         匹配零个或多个空白字符。
[\+-]?                      匹配正号或负号的零个或一个匹配项。
\w?                          匹配零个或一个空白字符。
\$?                          匹配美元符号的零个或一个匹配项。
\w?                          匹配零个或一个空白字符。
\d*                          匹配零个或多个十进制数字。
\.?                           匹配零个或一个小数点符号。
\d{2}?                     匹配两位十进制数零次或一次。
(\d*\.?\d{2}?){1}    至少匹配一次由小数点符号分隔整数和小数的模式。
$                             匹配字符串的末尾部分。

在这种情况下，正则表达式假定有效货币字符串不包括组分隔符，并且此字符串既没有小数数字，也没有由当前区域性的 CurrencyDecimalDigits 属性定义的小数位数。

using System;
using System.Globalization;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      // Get the current NumberFormatInfo object to build the regular 
      // expression pattern dynamically.
      NumberFormatInfo nfi = NumberFormatInfo.CurrentInfo;

      // Define the regular expression pattern.
      string pattern; 
      pattern = @"^\w*[";
      // Get the positive and negative sign symbols.
      pattern += Regex.Escape(nfi.PositiveSign + nfi.NegativeSign) + @"]?\w?";
      // Get the currency symbol.
      pattern += Regex.Escape(nfi.CurrencySymbol) + @"?\w?";
      // Add integral digits to the pattern.
      pattern += @"(\d*";
      // Add the decimal separator.
      pattern += Regex.Escape(nfi.CurrencyDecimalSeparator) + "?";
      // Add the fractional digits.
      pattern += @"\d{";
      // Determine the number of fractional digits in currency values.
      pattern += nfi.CurrencyDecimalDigits.ToString() + "}?){1}$";

      Regex rgx = new Regex(pattern);

      // Define some test strings.
      string[] tests = { "-42", "19.99", "0.001", "100 USD", 
                         ".34", "0.34", "1,052.21", "$10.62", 
                         "+1.43", "-$0.23" };

      // Check each test string against the regular expression.
      foreach (string test in tests)
      {
         if (rgx.IsMatch(test))
            Console.WriteLine("{0} is a currency value.", test);
         else
            Console.WriteLine("{0} is not a currency value.", test);
      }
   }
}
// The example displays the following output:
//       -42 is a currency value.
//       19.99 is a currency value.
//       0.001 is not a currency value.
//       100 USD is not a currency value.
//       .34 is a currency value.
//       0.34 is a currency value.
//       1,052.21 is not a currency value.
//       $10.62 is a currency value.
//       +1.43 is a currency value.
//       -$0.23 is a currency value.

因为本示例中的正则表达式是动态生成的，所以在设计时我们不知道正则表达式引擎是否可能将当前区域性的货币符号、小数符号或正号及负号错误解释为正则表达式语言运算符。若要防止任何解释错误，本示例将每个动态生成的字符串传递到 Escape 方法。

3、提取字符串内重复的单词信息（自写）

string text = "The the quick brown fox  fox over the lazy dog dog t2he 111 t2he 中文 2e 中文 _ef _Ef.";
Response.Write(text + "<br/>");
        
//1.找到有重复的单词
Regex rx = new Regex(@"(?<word>\b\w+\b).*?(\k<word>)",
  RegexOptions.Compiled | RegexOptions.IgnoreCase);
MatchCollection matches = rx.Matches(text);          
foreach (Match match in matches)
{
    string strResponse = "";
    string strRepeat = match.Groups["word"].Value;

    //2.提取找到的单词，获得其位置
    Regex rx2 = new Regex(strRepeat, RegexOptions.IgnoreCase);
    MatchCollection matches2 = rx2.Matches(text);
    strResponse += string.Format("单词 {0} 重复了 {1} 次，其位置分别是： "
                                , strRepeat
                                , matches2.Count);
    foreach (Match match2 in matches2)
    {
        strResponse += match2.Index + " ";
    }

    Response.Write("<br/>" + strResponse + "<br/>");
}

/*
 * 说明：
 \w      与任何单词字符匹配。
 .       通配符：与除 \n 之外的任何单个字符匹配。 
 *?      重复任意次，但尽可能少重复(懒惰匹配)
  
 * 输出：
 The the quick brown fox fox over the lazy dog dog t2he 111 t2he 中文 2e 中文 _ef _Ef.

单词 The 重复了 3 次，其位置分别是： 0 4 34 

单词 fox 重复了 2 次，其位置分别是： 20 25 

单词 dog 重复了 2 次，其位置分别是： 43 47 

单词 t2he 重复了 2 次，其位置分别是： 51 60 

单词 中文 重复了 2 次，其位置分别是： 65 71 

单词 _ef 重复了 2 次，其位置分别是： 74 78 
 */

posted on 2010-08-14 11:36 黄小二阅读(1540) 评论(0) 编辑收藏举报