PostgreSQL正则表达式

基础:

Operator Description Example
~ Matches regular expression, case sensitive 'thomas' ~ '.*thomas.*'
~* Matches regular expression, case insensitive 'thomas' ~* '.*Thomas.*'
!~ Does not match regular expression, case sensitive 'thomas' !~ '.*Thomas.*'
!~* Does not match regular expression, case insensitive 'thomas' !~* '.*vadim.*'

注意:~相当于like

Some examples:

 

'abc' ~ 'abc'    true
'abc' ~ '^a'     true
'abc' ~ '(b|d)'  true
'abc' ~ '^(b|c)' false

 

 

substring('foobar' from 'o.b')     oob

 substring('foobar' from 'o(.)b') o

 

--注意:如果有标志位g表示全部替换,否则只在第一次匹配进行替换

regexp_replace('foobarbaz', 'b..', 'X')
                                   fooXbaz
regexp_replace('foobarbaz', 'b..', 'X', 'g')
                                   fooXX
regexp_replace('foobarbaz', 'b(..)', E'X\\1Y', 'g')
                                   fooXarYXazY

 

 

再比如;

SELECT regexp_matches('foobarbequebaz', '(bar)(beque)');
 regexp_matches 
----------------
 {bar,beque}
(1 row)

SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
 regexp_matches 
----------------
 {bar,beque}
 {bazil,barf}
(2 rows)

SELECT regexp_matches('foobarbequebaz', 'barbeque');
 regexp_matches 
----------------
 {barbeque}
(1 row)

还有:

SELECT foo FROM regexp_split_to_table('the quick brown fox jumped over the lazy dog', E'\\s+') AS foo;
  foo  
--------
the   
quick 
brown 
fox   
jumped
over  
the   
lazy  
dog   
(9 rows)

SELECT regexp_split_to_array('the quick brown fox jumped over the lazy dog', E'\\s+');
              regexp_split_to_array            
------------------------------------------------
{the,quick,brown,fox,jumped,over,the,lazy,dog}
(1 row)

SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
foo
-----
t        
h        
e        
q        
u        
i        
c        
k        
b        
r        
o        
w        
n        
f        
o        
x        
(16 rows)

Common usages

We are going to go backwards a bit. We will start with demonstrations of PostgreSQL SQL statements that find and replace things and list things out with regular expressions. For these exercises we will be using a contrived table called notes, which you can create with the following code.

CREATE TABLE notes(note_id serial primary key, description text); INSERT INTO notes(description) VALUES('John''s email address is johnny@johnnydoessql.com. Priscilla manages the http://www.johnnydoessql.com site. She also manages the site http://jilldoessql.com and can be reached at 345.678.9999 She can be reached at (123) 456-7890 and her email address is prissy@johnnydoessql.com or prissy@jilldoessql.com.'); INSERT INTO notes(description) VALUES('I like ` # marks and other stuff that annoys militantdba@johnnydoessql.com. Militant if you have issues, give someone who gives a damn a call at (999) 666-6666.');

Regular Expressions in PostgreSQL

PostgreSQL has a rich set of functions and operators for working with regular expressions. The ones we commonly use are ~, regexp_replace, and regexp_matches.

We use the PostgreSQL g flag in our use more often than not. The g flag is the greedy flag that returns, replaces all occurrences of the pattern. If you leave the flag out, only the first occurrence is replaced or returned in the regexp_replace, regexp_matches constructs. The ~ operator is like the LIKE statement, but for regular expressions.

Destroying information

The power of databases is not only do they allow you to store/retrieve information quickly, but they allow you to destroy information just as quickly. Every database programmer should be versed in the art of information destruction.

 -- remove email addresses if description has email address 
 UPDATE notes SET description = regexp_replace(description, 
       E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]{2,4}', 
        '---', 'g') 
        WHERE description 
        ~ E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]{2,4}'; 
        
 -- remove website urls if description has website urls 
 -- matches things like http://www.something.com or http://something.com

 UPDATE notes SET description = regexp_replace(description, 
       E'http://[[[:alnum:]]+.]*[[:alnum:]]+[.][[:alnum:]]+', 
        E'--', 'g') 
        WHERE description 
        ~ E'http://[[[:alnum:]]+.]*[[:alnum:]]+[.][[:alnum:]]+'; 

-- remove phone numbers if description 
-- has phone numbers e.g. (123) 456-7890 or 456-7890 or 123.456.7890
UPDATE notes SET description = regexp_replace(description, 
      E'[\(]{0,1}[0-9]{3}[\).-]{0,1}[[:space:]]*[0-9]{3}[.-]{0,1}[0-9]{4}',
        '---', 'g') 
        WHERE description 
        ~ E'[\(]{0,1}[0-9]{3}[\).-]{0,1}[[:space:]]*[0-9]{3}[.-]{0,1}[0-9]{4}'; 

-- set anything to single space that is not not a \ ( ) & * /,;. > < space or alpha numeric        
UPDATE notes set description = regexp_replace(description, E'[^\(\)\&\/,;\*\:.\>\<[:space:]a-zA-Z0-9-]', ' ') 
  WHERE description ~ E'[^\(\)\&\/,;\*\:.\<\>[:space:]a-zA-Z0-9-]';
  
-- replace high byte characters with space 
-- this is useful if you have your database in utf8 and you often need to use latin1 encoding
-- and you have a table that shouldn't have high byte characters 
-- such as junk you get from scraping websites
-- high byte characters don't convert down to latin1
UPDATE notes SET description = regexp_replace(description,E'[^\x01-\x7E]', ' ', 'g')
WHERE description ~ E'[^\x01-\x7E]';

Getting list of matches

These examples use similar to our destroy but show us in a table, a list of stuff that match. Here we use our favorite PostgreSQL regexp_matches function.



-- list first email address from each note --
SELECT note_id, 
  (regexp_matches(description, 
    E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+'))[1]  As email
FROM notes
WHERE description ~ E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+'
ORDER By note_id, email;

-- result
 note_id |             email
---------+-------------------------------
       1 | johnny@johnnydoessql.com
       2 | militantdba@johnnydoessql.com



--list all email addresses
-- note this uses the PostgreSQL 8.4 unnest construct 
-- to convert the array returned to a table
SELECT note_id, 
  unnest(
    regexp_matches(description, 
    E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+', 'g')
  ) As email
FROM notes
WHERE description ~ E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+'
ORDER By note_id, email;


-- returns
 note_id |             email
---------+-------------------------------
       1 | johnny@johnnydoessql.com
       1 | prissy@jilldoessql.com
       1 | prissy@johnnydoessql.com
       2 | militantdba@johnnydoessql.com

Parts of a Regular Expresssion

Here we will just cover what we consider the parts you need to know if you don't have the patience or memory to remember more. Regular expressions are much richer than our simplified view of things. There is the great backreferencing feature we won't get into which allows you to reference an expression and use it as part of your replacement expression.

Part Example
Classes [] Regular expression classes are a set of characters that you can treat as interchangeable. They are formed by enclosing the characters in a bracket. They can also have nested classes. For example [A-Za-z] will match any letter between A-Z and a-z. [A-Za-z[:space:]] will match those plus white spaces in PostgreSQL. If you need to match a regular expression character such as ( then you escape it with \. so [A-Za-z\(\)] will match A thru Z a thru z and ( ). Classes can contain other classes and expressions as members.
. The famous . matches any character. So the infamous .* means one or more of anything.
Quantity {} + * You denote quantities with {}, +, * + means 1 or more. * means 0 or more and {} to denote allowed quantity ranges. [A-Za-z]{1,5} means you can have between 1 and 5 alpha characters in and expression for it to be a match.
() This is how you denote a subexpression. A subexpression can be composed of multiple classes etc and can be backreferenced. They define a specific sequence of characters.
[^members here] This is the NOT operator in regular expressions so for example [^A-Za-z] will match any character that is not in the alphabet.
Special classes [[:alnum:]] any alphanumeric, [[:space:]] any white space character. There are others, but those are the most commonly used.

 

 

 

Atom Description
(re) (where re is any regular expression) matches a match for re, with the match noted for possible reporting
(?:re) as above, but the match is not noted for reporting (a "non-capturing" set of parentheses) (AREs only)
. matches any single character
[chars] a bracket expression, matching any one of the chars (see Section 9.7.3.2 for more detail)
\k (where k is a non-alphanumeric character) matches that character taken as an ordinary character, e.g., \\ matches a backslash character
\c where c is alphanumeric (possibly followed by other characters) is an escape, see Section 9.7.3.3 (AREs only; in EREs and BREs, this matches c)
{ when followed by a character other than a digit, matches the left-brace character {; when followed by a digit, it is the beginning of a bound (see below)
x where x is a single character with no other significance, matches that character

 

Quantifier Matches
* a sequence of 0 or more matches of the atom
+ a sequence of 1 or more matches of the atom
? a sequence of 0 or 1 matches of the atom
{m} a sequence of exactly m matches of the atom
{m,} a sequence of m or more matches of the atom
{m,n} a sequence of m through n (inclusive) matches of the atom; m cannot exceed n
*? non-greedy version of *
+? non-greedy version of +
?? non-greedy version of ?
{m}? non-greedy version of {m}
{m,}? non-greedy version of {m,}
{m,n}? non-greedy version of {m,n}

 

Escape Description
\d [[:digit:]]
\s [[:space:]]
\w [[:alnum:]_] (note underscore is included)
\D [^[:digit:]]
\S [^[:space:]]
\W [^[:alnum:]_] (note underscore is included)

来源:http://www.postgresonline.com/journal/archives/152-Regular-Expressions-in-PostgreSQL.html

 

posted on 2010-12-06 23:47  小司  阅读(3159)  评论(0编辑  收藏  举报