lmgsanm

每天学习一点,每天进步一点点…… Tomorrow is another beatifull day

导航

python模块:re

  1 #
  2 # Secret Labs' Regular Expression Engine
  3 #
  4 # re-compatible interface for the sre matching engine
  5 #
  6 # Copyright (c) 1998-2001 by Secret Labs AB.  All rights reserved.
  7 #
  8 # This version of the SRE library can be redistributed under CNRI's
  9 # Python 1.6 license.  For any other use, please contact Secret Labs
 10 # AB (info@pythonware.com).
 11 #
 12 # Portions of this engine have been developed in cooperation with
 13 # CNRI.  Hewlett-Packard provided funding for 1.6 integration and
 14 # other compatibility work.
 15 #
 16 
 17 r"""Support for regular expressions (RE).
 18 
 19 This module provides regular expression matching operations similar to
 20 those found in Perl.  It supports both 8-bit and Unicode strings; both
 21 the pattern and the strings being processed can contain null bytes and
 22 characters outside the US ASCII range.
 23 
 24 Regular expressions can contain both special and ordinary characters.
 25 Most ordinary characters, like "A", "a", or "0", are the simplest
 26 regular expressions; they simply match themselves.  You can
 27 concatenate ordinary characters, so last matches the string 'last'.
 28 
 29 The special characters are:
 30     "."      Matches any character except a newline.
 31     "^"      Matches the start of the string.
 32     "$"      Matches the end of the string or just before the newline at
 33              the end of the string.
 34     "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
 35              Greedy means that it will match as many repetitions as possible.
 36     "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
 37     "?"      Matches 0 or 1 (greedy) of the preceding RE.
 38     *?,+?,?? Non-greedy versions of the previous three special characters.
 39     {m,n}    Matches from m to n repetitions of the preceding RE.
 40     {m,n}?   Non-greedy version of the above.
 41     "\\"     Either escapes special characters or signals a special sequence.
 42     []       Indicates a set of characters.
 43              A "^" as the first character indicates a complementing set.
 44     "|"      A|B, creates an RE that will match either A or B.
 45     (...)    Matches the RE inside the parentheses.
 46              The contents can be retrieved or matched later in the string.
 47     (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
 48     (?:...)  Non-grouping version of regular parentheses.
 49     (?P<name>...) The substring matched by the group is accessible by name.
 50     (?P=name)     Matches the text matched earlier by the group named name.
 51     (?#...)  A comment; ignored.
 52     (?=...)  Matches if ... matches next, but doesn't consume the string.
 53     (?!...)  Matches if ... doesn't match next.
 54     (?<=...) Matches if preceded by ... (must be fixed length).
 55     (?<!...) Matches if not preceded by ... (must be fixed length).
 56     (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
 57                        the (optional) no pattern otherwise.
 58 
 59 The special sequences consist of "\\" and a character from the list
 60 below.  If the ordinary character is not on the list, then the
 61 resulting RE will match the second character.
 62     \number  Matches the contents of the group of the same number.
 63     \A       Matches only at the start of the string.
 64     \Z       Matches only at the end of the string.
 65     \b       Matches the empty string, but only at the start or end of a word.
 66     \B       Matches the empty string, but not at the start or end of a word.
 67     \d       Matches any decimal digit; equivalent to the set [0-9] in
 68              bytes patterns or string patterns with the ASCII flag.
 69              In string patterns without the ASCII flag, it will match the whole
 70              range of Unicode digits.
 71     \D       Matches any non-digit character; equivalent to [^\d].
 72     \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
 73              bytes patterns or string patterns with the ASCII flag.
 74              In string patterns without the ASCII flag, it will match the whole
 75              range of Unicode whitespace characters.
 76     \S       Matches any non-whitespace character; equivalent to [^\s].
 77     \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
 78              in bytes patterns or string patterns with the ASCII flag.
 79              In string patterns without the ASCII flag, it will match the
 80              range of Unicode alphanumeric characters (letters plus digits
 81              plus underscore).
 82              With LOCALE, it will match the set [0-9_] plus characters defined
 83              as letters for the current locale.
 84     \W       Matches the complement of \w.
 85     \\       Matches a literal backslash.
 86 
 87 This module exports the following functions:
 88     match     Match a regular expression pattern to the beginning of a string.
 89     fullmatch Match a regular expression pattern to all of a string.
 90     search    Search a string for the presence of a pattern.
 91     sub       Substitute occurrences of a pattern found in a string.
 92     subn      Same as sub, but also return the number of substitutions made.
 93     split     Split a string by the occurrences of a pattern.
 94     findall   Find all occurrences of a pattern in a string.
 95     finditer  Return an iterator yielding a match object for each match.
 96     compile   Compile a pattern into a RegexObject.
 97     purge     Clear the regular expression cache.
 98     escape    Backslash all non-alphanumerics in a string.
 99 
100 Some of the functions in this module takes flags as optional parameters:
101     A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
102                    match the corresponding ASCII character categories
103                    (rather than the whole Unicode categories, which is the
104                    default).
105                    For bytes patterns, this flag is the only available
106                    behaviour and needn't be specified.
107     I  IGNORECASE  Perform case-insensitive matching.
108     L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
109     M  MULTILINE   "^" matches the beginning of lines (after a newline)
110                    as well as the string.
111                    "$" matches the end of lines (before a newline) as well
112                    as the end of the string.
113     S  DOTALL      "." matches any character at all, including the newline.
114     X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
115     U  UNICODE     For compatibility only. Ignored for string patterns (it
116                    is the default), and forbidden for bytes patterns.
117 
118 This module also defines an exception 'error'.
119 
120 """
121 
122 import enum
123 import sre_compile
124 import sre_parse
125 import functools
126 try:
127     import _locale
128 except ImportError:
129     _locale = None
130 
131 # public symbols
132 __all__ = [
133     "match", "fullmatch", "search", "sub", "subn", "split",
134     "findall", "finditer", "compile", "purge", "template", "escape",
135     "error", "A", "I", "L", "M", "S", "X", "U",
136     "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
137     "UNICODE",
138 ]
139 
140 __version__ = "2.2.1"
141 
142 class RegexFlag(enum.IntFlag):
143     ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
144     IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
145     LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
146     UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
147     MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
148     DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
149     VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
150     A = ASCII
151     I = IGNORECASE
152     L = LOCALE
153     U = UNICODE
154     M = MULTILINE
155     S = DOTALL
156     X = VERBOSE
157     # sre extensions (experimental, don't rely on these)
158     TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
159     T = TEMPLATE
160     DEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation
161 globals().update(RegexFlag.__members__)
162 
163 # sre exception
164 error = sre_compile.error
165 
166 # --------------------------------------------------------------------
167 # public interface
168 
169 def match(pattern, string, flags=0):
170     """Try to apply the pattern at the start of the string, returning
171     a match object, or None if no match was found."""
172     return _compile(pattern, flags).match(string)
173 
174 def fullmatch(pattern, string, flags=0):
175     """Try to apply the pattern to all of the string, returning
176     a match object, or None if no match was found."""
177     return _compile(pattern, flags).fullmatch(string)
178 
179 def search(pattern, string, flags=0):
180     """Scan through string looking for a match to the pattern, returning
181     a match object, or None if no match was found."""
182     return _compile(pattern, flags).search(string)
183 
184 def sub(pattern, repl, string, count=0, flags=0):
185     """Return the string obtained by replacing the leftmost
186     non-overlapping occurrences of the pattern in string by the
187     replacement repl.  repl can be either a string or a callable;
188     if a string, backslash escapes in it are processed.  If it is
189     a callable, it's passed the match object and must return
190     a replacement string to be used."""
191     return _compile(pattern, flags).sub(repl, string, count)
192 
193 def subn(pattern, repl, string, count=0, flags=0):
194     """Return a 2-tuple containing (new_string, number).
195     new_string is the string obtained by replacing the leftmost
196     non-overlapping occurrences of the pattern in the source
197     string by the replacement repl.  number is the number of
198     substitutions that were made. repl can be either a string or a
199     callable; if a string, backslash escapes in it are processed.
200     If it is a callable, it's passed the match object and must
201     return a replacement string to be used."""
202     return _compile(pattern, flags).subn(repl, string, count)
203 
204 def split(pattern, string, maxsplit=0, flags=0):
205     """Split the source string by the occurrences of the pattern,
206     returning a list containing the resulting substrings.  If
207     capturing parentheses are used in pattern, then the text of all
208     groups in the pattern are also returned as part of the resulting
209     list.  If maxsplit is nonzero, at most maxsplit splits occur,
210     and the remainder of the string is returned as the final element
211     of the list."""
212     return _compile(pattern, flags).split(string, maxsplit)
213 
214 def findall(pattern, string, flags=0):
215     """Return a list of all non-overlapping matches in the string.
216 
217     If one or more capturing groups are present in the pattern, return
218     a list of groups; this will be a list of tuples if the pattern
219     has more than one group.
220 
221     Empty matches are included in the result."""
222     return _compile(pattern, flags).findall(string)
223 
224 def finditer(pattern, string, flags=0):
225     """Return an iterator over all non-overlapping matches in the
226     string.  For each match, the iterator returns a match object.
227 
228     Empty matches are included in the result."""
229     return _compile(pattern, flags).finditer(string)
230 
231 def compile(pattern, flags=0):
232     "Compile a regular expression pattern, returning a pattern object."
233     return _compile(pattern, flags)
234 
235 def purge():
236     "Clear the regular expression caches"
237     _cache.clear()
238     _compile_repl.cache_clear()
239 
240 def template(pattern, flags=0):
241     "Compile a template pattern, returning a pattern object"
242     return _compile(pattern, flags|T)
243 
244 _alphanum_str = frozenset(
245     "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
246 _alphanum_bytes = frozenset(
247     b"_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
248 
249 def escape(pattern):
250     """
251     Escape all the characters in pattern except ASCII letters, numbers and '_'.
252     """
253     if isinstance(pattern, str):
254         alphanum = _alphanum_str
255         s = list(pattern)
256         for i, c in enumerate(pattern):
257             if c not in alphanum:
258                 if c == "\000":
259                     s[i] = "\\000"
260                 else:
261                     s[i] = "\\" + c
262         return "".join(s)
263     else:
264         alphanum = _alphanum_bytes
265         s = []
266         esc = ord(b"\\")
267         for c in pattern:
268             if c in alphanum:
269                 s.append(c)
270             else:
271                 if c == 0:
272                     s.extend(b"\\000")
273                 else:
274                     s.append(esc)
275                     s.append(c)
276         return bytes(s)
277 
278 # --------------------------------------------------------------------
279 # internals
280 
281 _cache = {}
282 
283 _pattern_type = type(sre_compile.compile("", 0))
284 
285 _MAXCACHE = 512
286 def _compile(pattern, flags):
287     # internal: compile pattern
288     try:
289         p, loc = _cache[type(pattern), pattern, flags]
290         if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
291             return p
292     except KeyError:
293         pass
294     if isinstance(pattern, _pattern_type):
295         if flags:
296             raise ValueError(
297                 "cannot process flags argument with a compiled pattern")
298         return pattern
299     if not sre_compile.isstring(pattern):
300         raise TypeError("first argument must be string or compiled pattern")
301     p = sre_compile.compile(pattern, flags)
302     if not (flags & DEBUG):
303         if len(_cache) >= _MAXCACHE:
304             _cache.clear()
305         if p.flags & LOCALE:
306             if not _locale:
307                 return p
308             loc = _locale.setlocale(_locale.LC_CTYPE)
309         else:
310             loc = None
311         _cache[type(pattern), pattern, flags] = p, loc
312     return p
313 
314 @functools.lru_cache(_MAXCACHE)
315 def _compile_repl(repl, pattern):
316     # internal: compile replacement pattern
317     return sre_parse.parse_template(repl, pattern)
318 
319 def _expand(pattern, match, template):
320     # internal: match.expand implementation hook
321     template = sre_parse.parse_template(template, pattern)
322     return sre_parse.expand_template(template, match)
323 
324 def _subx(pattern, template):
325     # internal: pattern.sub/subn implementation helper
326     template = _compile_repl(template, pattern)
327     if not template[0] and len(template[1]) == 1:
328         # literal replacement
329         return template[1][0]
330     def filter(match, template=template):
331         return sre_parse.expand_template(template, match)
332     return filter
333 
334 # register myself for pickling
335 
336 import copyreg
337 
338 def _pickle(p):
339     return _compile, (p.pattern, p.flags)
340 
341 copyreg.pickle(_pattern_type, _pickle, _compile)
342 
343 # --------------------------------------------------------------------
344 # experimental stuff (see python-dev discussions for details)
345 
346 class Scanner:
347     def __init__(self, lexicon, flags=0):
348         from sre_constants import BRANCH, SUBPATTERN
349         self.lexicon = lexicon
350         # combine phrases into a compound pattern
351         p = []
352         s = sre_parse.Pattern()
353         s.flags = flags
354         for phrase, action in lexicon:
355             gid = s.opengroup()
356             p.append(sre_parse.SubPattern(s, [
357                 (SUBPATTERN, (gid, 0, 0, sre_parse.parse(phrase, flags))),
358                 ]))
359             s.closegroup(gid, p[-1])
360         p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
361         self.scanner = sre_compile.compile(p)
362     def scan(self, string):
363         result = []
364         append = result.append
365         match = self.scanner.scanner(string).match
366         i = 0
367         while True:
368             m = match()
369             if not m:
370                 break
371             j = m.end()
372             if i == j:
373                 break
374             action = self.lexicon[m.lastindex-1][1]
375             if callable(action):
376                 self.match = m
377                 action = action(self, m.group())
378             if action is not None:
379                 append(action)
380             i = j
381         return result, string[i:]
python:re

 

posted on 2018-01-17 21:30  lmgsanm  阅读(210)  评论(0编辑  收藏  举报