14、Linux Shell 笔记(7),正则表达式

正则表达式

A regular expression is a pattern template you define that a Linux utility(sed,gawk) uses to filter text.

wps_clip_image-16400

正则表达式由正则表达式引擎来实现(regular expression engine)

In the Linux world, there are two popular regular expression engines:

The POSIX Basic Regular Expression (BRE) engine

The POSIX Extended Regular Expression (ERE) engine

定义BRE模式

1、纯文本

$ echo "This is a test" | sed -n /test/p

This is a test

正则表达式并不关心模式出现在数据流中的位置,关键是匹配正则表达式模式与数据流文本。正则表达式模式区分大小写。空格像其它字符一样处理。

2、特殊字符

The special characters recognized by regular expressions are:

.*[]^${}\+?|()

不要在文本模式中单独使用这些字符。可以用转义字符(\)把这些字符当作普通字符。

3、定位符

1The caret character (^) defines a pattern that starts at the beginning of a line of text in the data stream. If the pattern is located any place other than the start of the line of text, the regular expression pattern fails.

$ echo "Books are great" | sed -n /^Book/p

Books are great

2The opposite of looking for a pattern at the start of a line is looking for it at the end of a line. The dollar sign ($) special character defines the end anchor. Add this special character after a text pattern to indicate that the line of data must end with the text pattern:

$ echo "This is a good book" | sed -n /book$/p

This is a good book

3The dot special character is used to match any single character except a newline character. The dot character must match a character though; if theres no character in the place of the dot, then the pattern will fail.

4)字符类

用方括号来定义字符类。

$ sed -n /[ch]at/pdata6

The cat is sleeping.

That is a very nice hat.

5)否定字符类

$ sed -n /[^ch]at/pdata6

This test is at line two.

6)使用范围

You can use a range of characters within a character class by using the dash symbol

Just specify the first character in the range, a dash, then the last character in the range. The regular expression includes any character thats within the specified character range

$ sed -n /^[0-9][0-9][0-9][0-9][0-9]$/pdata8

60633

46201

45902

7)特殊字符类

BRE Special Character Classes

Class

Description

[[:alpha:]]

 Match any alphabetical character, either upper or lower case.

[[:alnum:]]

Match any alphanumeric character 0–9, A–Z, or a–z.

[[:blank:]]

 Match a space or Tab character.

[[:digit:]]

Match a numerical digit from 0 through 9.

[[:lower:]]

Match any lower-case alphabetical character a–z.

[[:print:]]

Match any printable character.

[[:punct:]]

 Match a punctuation character.

[[:space:]]

 Match any whitespace character: space, Tab, NL, FF, VT, CR.

[[:upper:]]

Match any upper-case alphabetical character A–Z.

8)星号

Placing an asterisk after a character signifies that the character must appear zero or more times in the text to match the pattern:

$ echo "ik" | sed -n /ie*k/p

ik

扩展正则表达式

gawk支持,而sed不支持。

1)问号

The question mark is similar to the asterisk, but with a slight twist. The question mark indicates that the preceding character can appear zero or one time, but thats all. It doesnt match repeating occurrences of the character:

$ echo "bt" | gawk /be?t/{print $0}

Bt

2)加号

The plus sign is another pattern symbol thats similar to the asterisk, but with a different twist than the question mark. The plus sign indicates that the preceding character can appear one or more times, but must be present at least once. The pattern doesnt match if the character is not present:

$ echo "beeet" | gawk /be+t/{print $0}

beeet

3)大括号

Curly braces are available in ERE to allow you to specify a limit on a repeatable regular expression. This is often referred to as an interval. You can express the interval in two formats:

m: The regular expression appears exactly m times.

m,n: The regular expression appears at least m times, but no more than n times.

This feature allows you to fine-tune exactly how many times you allow a character (or character class) to appear in a pattern.

4)管道符号

The pipe symbol allows to you to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream. If any of the patterns match the data stream text, the text passes. If none of the patterns match, the data stream text fails.

The format for using the pipe symbol is:

expr1|expr2|...

$ echo "The cat is asleep" | gawk /cat|dog/{print $0}

The cat is asleep

5)将表达式分组

Regular expression patterns can also be grouped by using parentheses. When you group a regular expression pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character.

$ echo "Sat" | gawk /Sat(urday)?/{print $0}

Sat

$ echo "Saturday" | gawk /Sat(urday)?/{print $0}

Saturday

$

几个例子

1)计算文件目录

$ cat countfiles

#!/bin/bash

# count number of files in your PATH

mypath=`echo $PATH | sed s/:/ /g`

count=0

for directory in $mypath

do

check=`ls $directory`

for item in $check

do

count=$[ $count + 1 ]

done

echo "$directory - $count"

count=0

done

$ ./countfiles

/usr/local/bin - 79

/bin - 86

/usr/bin - 1502

/usr/X11R6/bin - 175

/usr/games - 2

/usr/java/j2sdk1.4.1 01/bin - 27

$

/usr/local/bin - 79

/bin - 86

/usr/bin - 1502

/usr/X11R6/bin - 175

2)验证电话号码

$ cat isphone

#!/bin/bash

# script to filter out bad phone numbers

gawk --re-interval /^\(?[2-9][0-9]{2}\)?(| |-|\.)

[0-9]{3}( |-|\.)[0-9]{4}/{print $0}

$

By default, the gawk program doesnt recognize regular expression intervals. You must specify the --re-interval command line option for the gawk program to recognize

regular expression intervals.

(123)456-7890

(123) 456-7890

123-456-7890

123.456.7890

3)解析电子邮件地址

^([a-zA-Z0-9 \-\.\+]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5})$

参考:

1Linux命令行和SHELL脚本编程

posted @ 2010-10-17 16:13  浪里飞  阅读(2039)  评论(0编辑  收藏  举报