代码改变世界

oracle的正则表达式

2011-06-15 14:22  Tracy.  阅读(673)  评论(0编辑  收藏  举报

必须明白的是,oracle的正则表达式和其它的语言的并不是完全一样。
oracle支持以下标准的正则表达式:
  • IEEE Portable Operating System Interface (POSIX) standard draft 1003.2/D11.2
  • Unicode Regular Expression Guidelines of the Unicode Consortium
  • Extends the matching capabilities for multilingual data beyond what is specified in the POSIX standard(即posix的多语言支持).
  • Adds support for the common Perl regular expression extensions that are not
included in the POSIX standard but do not conflict with it. Oracle Database
provides built-in support for some of the most heavily used Perl regular
expression operators, for example, character class shortcuts, the non-greedy
modifier, and so on(即部分perl标准,不包含在posix标准,但是又不与之冲突的)


---2008-09-08 做了个实验,关键的要点是三个,重复次数{},+,*,?
---以及匹配模式[...] ,[^...],(...),(...)
--以及转义符号 \d,\D,\\等。
------------------------------------------------------------------------------------------------------------------
ID VALUE
--------------------------------------- ----------------------------------------
9 lzf is a chinese
10 1231231231lzf
11 perfect
1 13625061300
2 059183383335
3 13959029697
4 13950323157
5 who is your daddy?
6 who are you?
6 what is your name?
6 I am 20 years old!
7 股市下挫到2100点左右
8 lzf ' is very happy 136ok
8 do you know lzf,ok?
8 ok,lzf is my brother
------------------------------------------------------------------------------------------------------------------

select * from test_reg where regexp_like(value,'^1(.+)7$') --1开头,7结尾的内容
select * from test_reg where regexp_like(value,'\d+') --一个或者多个数字
select * from test_reg where regexp_like(value,'^[a-zA-Z]{1,}[a-zA-Z]+$') --全部字符的.[:digit:]
select * from test_reg where regexp_like(value,'^[0-9]{1,}[0-9]+$') --全部数字的.
select * from test_reg where regexp_like(value,'[y][o][u]') --包含you
select * from test_reg where regexp_like(value,'(you)') --包含you
select * from test_reg where regexp_like(value,'[you]') --包含y o u的,不分顺序
select * from test_reg where regexp_like(value,'(lzf){1,}(.){0,}(ok)') --包含lzf 和ok的. 但是也仅仅限制与lzf 在ok前
以上试验是在看了下文以及参考了OReilly.Oracle.PL.SQL.Programming.4th.Edition.Aug.2005.chs后作出的.
--------------
2010/08/04 增加以下例子

---
--以'POSIX('开头的串,以下几个表达式的结果都是一样的。
select 1 from dual where regexp_like('POSIX(ADDDDDD,Y,13)','^(POSIX\()([A-Z]+)(,)([RYN]{1})(,)\d{1,}?\)')

select 1 from dual where regexp_like('POSIX(ADDDDDD,Y,13)','^(POSIX\()([A-Z]+)(,)([RYN]{1})(,)\d+\)')
select 1 from dual where regexp_like('POSIX(ADDDDDD,Y,13)','^(POSIX\()([A-Z]+)(,)([RYN]{1}),\d+\)')
select 1 from dual where regexp_like('POSIX(ADDDDDD,Y,13)','^(POSIX\()([A-Z]+),([RYN]{1}),\d+\)')
select 1 from dual where regexp_like('POSIX(ADDDDDD,Y,13)','^(POSIX\()([A-Z]+),([RYN]+),\d+\)')

select 1 from dual where regexp_like('POSIX(ADDDDDD,Y,13)','^(POSIX\()([A-Z]+),[RYN]+,\d+\)')
select 1 from dual where regexp_like('POSIX(A,N,13)','^(POSIX\()([A-Z]+),[RYN]+,\d+\)')
----------
--位于字符串中的串
select 1 from dual where regexp_like('POSIX(A,N,13)','^(POSIX\()([A-Z]+),[RYN]+,\d+\)')
select 1 from dual where regexp_like('P_OSI_X(JYL_GP_R,N,13)','^[A-Z_]+\(([A-Z_]+),[RYN]+,\d+\)')

--测试regexp_substr,基本的还是regexp_like
SQL> select regexp_substr('P_OSI_X(JYL_GP_R,N,13)+AVG(JYL_JJ_R,Y,3)','^[A-Z_]+\(([A-Z_]+),[RYN]+,\d+\)') FROM DUAL
2 /

REGEXP_SUBSTR('P_OSI_X(JYL_GP_
------------------------------
P_OSI_X(JYL_GP_R,N,13)

SQL> select regexp_substr('P_OSI_X(JYL_GP_R,N,13)+AVG(JYL_JJ_R,Y,3)','[A-Z_]+\(([A-Z_]+),[RYN]+,\d+\)',1,2) FROM DUAL
2 /

REGEXP_SUBSTR('P_OSI_X(JYL_GP_
------------------------------
AVG(JYL_JJ_R,Y,3)

2010/08/26 新增例子
-- delete the first ()
select regexp_replace('dad(disc_id=50003118)add','(^([A-Z]{1,}\()|(\)[A-Z]{1,})$)','',1,0,'i') FROM DUAL
--
match pattern
using classes
select regexp_substr(upper('dad(ssdsd89\9 dd== (sddf(d)),sdsd)add'),'^([A-Z]{1,}\()[[:space:][:alnum:]\,\_\(\)\=]{1,}\)[A-Z]{1,}$',1,1,'i') FROM DUAL



OReilly. 的参考内容
Section A.1. Metacharacters

A.1. Metacharacters

The R2 column in Tables A-1 through A-3 indicates which metacharacters were introduced in Oracle Database 10g Release 2. All others were introduced in Release 1.

Table A-1. Character-matching metacharacters

Syntax

R2?

Description

.

 

Matches any single character except for newline. Will match newline when the n flag is set. On Windows, Linux, and Unix platforms, chr(10) is recognized as the newline.

[ ... ]

 

Defines a matching list that matches any character listed between the brackets. You may specify ranges of characters, as in a-z. These ranges are interpreted based on the NLS_SORT setting.

A dash (-) is a literal when it occurs first or last in the list (e.g., [abc-]). A closing-bracket (]) is a literal when it occurs first in the list (e.g., []abc]). A caret (^) in the first position makes the list a nonmatching list (see the next entry).

[^ ... ]

 

Matches any character not listed between the brackets. Referred to as a "nonmatching list."

[:class:]

 

Matches any character that belongs to the specified character class. May only be used within a matching list: [[:class:]abc] is a valid expression, [:class:]abc is not. Table A-5 lists the valid character class names.

[.coll.]

 

Matches the specified collation element, which may be one or more characters. May only be used within a matching list. For example, the expression [[.ch.]] matches the Spanish letter "ch". Table A-4 lists the valid collation elements.

[=char=]

 

Matches all characters that share the same base character as char. May be used only within a matching list. For example, [[=e=]] matches any of: "eé?è???E".

\d


Matches any digit. Equivalent to [[:digit:]].

\D


Matches any nondigit. Equivalent to [^[:digit:]]

\w


Matches any "word character." Word characters are defined to be alphabetic characters, numeric characters, and the underscore.

\W


Matches any nonword character.

\s


Matches any whitespace character. Equivalent to [[:space:]].

\S


Matches nonwhitespace characters. Equivalent to [^[:space:]].


Table A-2. Quantifiers

Syntax

R2?

Description

+

 

One or more

?

 

Zero or one

*

 

Zero or more

{m}

 

Exactly m occurrences

{m,}

 

At least m occurrences

{m,n}

 

At least m, and at most n occurrences

+?


One or more, but nongreedy

??


Zero or one, but nongreedy

{m}?


The same as {m}

{m,}?


At least m occurrences, but nongreedy and stops as soon as m occurrences are reached

{m,n}?


At least m, and at most n occurrences, but nongreedy; when possible, m occurrences are matched


Table A-3. Other metacharacters

Syntax

R2?

Description

|

 

Specifies an alternation. An alternation within a subexpression doesn't extend beyond the subexpression.

( ...)

Defines a subexpresson.

\n

References the text matched by the nth subexpression. Backreferences may range from \1 through \9.

\

When not followed by a digit, the \ is an escape character. For example, use the pattern \\1 to look for a single backslash followed by the digit 1, use \( to look for an opening-parentheses (rather than begin a subexpression), etc.

^

Anchors an expression to the beginning of the string (in multiline mode, to the beginning of a line).

$

Anchors an expression to the end of the string (in multiline mode, to the end of a line).

\A


Anchors an expression to the beginning of the string regardless of whether multiline mode is specified.

\Z


Anchors an expression to the end of the string, or the a newline that happens to be ending a string, regardless of whether multiline mode is specified.

\z


Anchors an expression to the end of the string regardless of whether multiline mode is specified.


Table A-4. Collation elements

NLS_SORT

Multicharacter collation elements

XDANISH

aa

oe

AA

OE

Aa

Oe

XSPANISH

ch

ll

CH

LL

Ch

Ll

XHUNGARIAN

cs

gy

ly

ny

sz

ty

zs

CS

GY

LY

NY

SZ

TY

ZS

Cs

Gy

Ly

Ny

Sz

Ty

Zs

XCZECH

ch

CH

Ch

XCZECH_PUNCTUATION

ch

CH

Ch

XSLOVAK

dz

d_

ch

DZ

D_

CH

Dz

D_

Ch

XCROATIAN

d_

lj

nj

D_

LJ

Nj

D_

Lj

NJ


Table A-5. Supported character classes

Class

Description

[:alnum:]

Alphanumeric characters (same as [:alpha:] + [:digit:])

[:alpha:]

Alphabetic characters only

[:blank:]

Blank space characters, such as space and tab

[:cntrl:]

Nonprinting, or control characters

[:digit:]

Numeric digits

[:graph:]

Graphical characters (same as [:punct:] + [:upper:] + [:lower:] + [:digit:])

[:lower:]

Lowercase letters

[:print:]

Printable characters

[:punct:]

Punctuation characters

[:space:]

Whitespace characters such as space, formfeed, newline, carriage return, horizontal tab, and vertical tab

[:upper:]

Uppercase letters

[:xdigit:]

Hexadecimal characters


对于类的使用,要特别注意点,因为基本上要包含在一对中括号中。
例一:检验非汉字的函数
例如可以定义一个检验非汉字的函数,不考虑其它UNICODE
CREATE OR REPLACE Function func_existNormalChar(chars In Varchar2) Return Pls_Integer
Is
vResult Pls_Integer;
Begin
Select 1
Into vresult From dual
Where regexp_like(chars,'[!-~ ,。、?’“;:『』【】+=-——~!◎#¥%……※×()]+')
Or regexp_like(chars,'[[:space:]+]')
Or regexp_like(chars,'[[:blank:]+]')
Or regexp_like(chars,'[[:cntrl:]+]');
Return vresult;
Exception
When Others Then
Return 0;
End;

例二:出现次数的检验
序号 例子 结果 说明
1 regexp_replace('a12c','[[:digit:]]+','*') a*c 测试"+",即出现一次或者更多次的.由于采取的是连续的匹配,所以结果上是把多个数字合并为一个替代字符串,为了理解连续匹配的意思,看序号9更能够明白.这是总体偏少的模式
2 regexp_replace('a12c','[[:digit:]]?','*') *a***c* 测试"?",即0或者一次.这个符号,会在任意一个其它字符边上插入一个假设出现的匹配类型符号,如例子,a的左边本来是没有,但是表达式认为数字字符在a的左边出现了0次(这个是成立的).然后还要逐个替换过去,这是总体偏多的模式,过于匹配和逐个匹配结合
3 regexp_replace('a12c','[[:digit:]]*','*')
*a**c* 测试"*",即0或者更多次数,这个模式类似与?类似,采取连续匹配+多余匹配
4 regexp_replace('a12c','[[:digit:]]+?','*')
a**c 测试"+?",匹配在前面一次或者多次的串,采取的是逐个匹配的策略(如果有连续的情况出现),这个是通常最希望使用的模式.
5 regexp_replace('a12c','[[:digit:]]*?','*') *a*1*2*c* 测试"*?",匹配0或者更多次,难于理解! 但可以简单地理解为为在各个字符中以及两边插入替代串.
6 regexp_replace('a12c','[[:digit:]]??','*') *a*1*2*c* 测试"??",匹配0或者1次,难于理解啊! 但可以简单地理解为为在各个字符中以及两边插入替代串.
7 regexp_replace('a12c','[[:digit:]]{1,}','*')
a*c 测试{1,},同+,结果已经说明
8
9 regexp_replace('a1x2c','[[:digit:]]+','*') a*x*c 测试"+",这个例子说明,如果被检测的字符串中特定的匹配串如果不是连续的(多次),那么替代的还是属于正常的范围.


其它的次数模式暂时没有测试,对于 *,?,+ ,*?,+?,??,基本有其对应的模式 {n,m}? 或者是{n,}? ,除了*,*?之外,因为此二者表示的都是0或者多次(大于1),无法用{n,m}的方式表示.

例三:多个类的并用
很多时候使用类还是很方便的,例如[:digit:]表示数字.[:space:]表示空格,换行,回车等等.
如果希望把特定串中的数字和回车空格等都替换为'-',则可以使用 "|"(替代操作符,其实就是类似一般高级语言中的或操作符号).
select regexp_replace('adfa jkjk jk
jk','([[:digit:]]+?)|([[:space:]]+?)','-') from dual;
结果如下:
adfa-jkjk-jk-jk
已经把空格和回车符号都替换为'-'了.
又例如:
Select regexp_replace('a1b c1d jk
jk','([[:digit:]]+?)|([[:space:]]+?)','-') From dual
结果如下:
a-b-c-d-jk-jk

例四:迭代替换
例如希望把以下串替换掉
'<xxxx <often> <common> <fat> >'


关于范围操作符号 -

发音:横杆或者减号

在列表匹配操作情况下,如果没有特别的设置,就是表示匹配一个范围:

例子:把字符串中非数字+-()的符号替换为空

这是一个不成功的匹配

SQL> select regexp_replace('1+[2c- 3([ad]km,4','[^0-9\+\-\(\)]','',1) from dual;

REGEXP_REPLACE('1+[2C-3([AD]KM
------------------------------
1+23(4

这也是一个错误的匹配,因为减号这个时候表示一个范围
SQL> select regexp_replace('1+[2c- 3([ad]km,4','[^0-9\+-\(\)]','',1) from dual;

REGEXP_REPLACE('1+[2C-3([AD]KM
------------------------------
1+[2-3([,4

二者的区别在于单独的一个减号,如果前面没有跟上斜杠\,那么就是表示范围,否则就是减号本身。

但即使如此,依然没有达到目的,第一个把减号还是给去除了,虽然排除了其它的需要排除的符号,

第二个,则是当作范围符号,大部分的没有排除掉。

但是不太明白的是,按照oracle文档的约定,在范围匹配中形如[\-]表示匹配一个减号,为何依然匹配补上了,尤其是当夹着其它需要匹配的模式的时候。

现在只能尝试着把这个模式放在最后:

SQL> select regexp_replace('1+[2c- 3([ad4','[^0-9\(+\)\-]','',1) from dual
2 /

REGEXP_REPLACE('1+[2C-3([AD4',
------------------------------
1+2-3(4

结果是可以,因为模式"\-"被放到了范围中的尾巴。

那是文档不全,还是理解不透彻了,

看来以后遇到这样的问题还是先把减号放到尾巴去吧。