第12章 正则表达式与文件格式化处理
基础正则表达式
语系对正则表达式的影响
不同语系下,字符的编码数据可能不同。
LANG=C:012……ABC……abc……
LANG=zh_CN:012……aAbB……
因此,使用[A-Z]时,搜索到的字符也不一样。
特殊符号 | 代表意义 |
[:alnum:] | 大小写字符及数字,0-9,A-Z,a-z |
[:alpha:] | 英文大小写字符 |
[:blank:] | 空格键与tab键 |
[:cntrl:] | 控制按键,CR,LF,TAB,DEL等 |
[:digit:] | 代表数字 |
[:graph:] | 除空格符(空格和Tab)外其他按键 |
[:lower:] | 小写字符 |
[:print:] | 可以被打印出来的字符 |
[:punct:] | 标点字符," ' ? ; : # $ |
[:upper:] | 大写字符 |
[:space:] | 任何会产生空白的字符 |
[:xdigit:] | 十六进制数字 |
grep的一些高级参数
除了上一章介绍的基本用法,grep还有一些高级用法。
grep [-A] [-B] [--color=auto} '搜寻字符串‘ filename
参数:
-A:后面可加数字n,为after的意思,除了列出该列,后面的n列也列出来
-B:后面可加数字n,为after的意思,除了列出该列,前面的n列也列出来
--color=auto:对正确选取的数据着色
//-n用于显示行号 [root@localhost 桌面]# dmesg | grep -n --color=auto 'eth' 1730:[ 10.210383] e1000 0000:02:01.0 eth0: (PCI:66MHz:32-bit) 00:0c:29:7f:dd:91 1731:[ 10.210404] e1000 0000:02:01.0 eth0: Intel(R) PRO/1000 Network Connection
注:grep搜索到字符串后都是以整行为单位显示。
基础正则表达式练习
以下是练习文本
[root@localhost 桌面]# cat regular_express.txt "Open Source" is a good mechanism to develop programs. apple is my favorite food. Football game is not use feet only. this dress doesn't fit me. However, this dress is about $ 3183 dollars. GNU is free air not free beer. Her hair is very beauty. I can't finish the test. Oh! The soup taste good. motorcycle is cheap than car. This window is clear. the symbol '*' is represented as start. Oh! My god! The gd software is a library for drafting programs. You are the best is mean you are the no. 1. The world <Happy> is the same with "glad". I like dog. google is the best tools for search keyword. goooooogle yes! go! go! Let's go. # I am VBird [root@localhost 桌面]#
例题一:查找特定字符串
//查找含有the的行 [root@localhost 桌面]# grep -n 'the' regular_express.txt 8:I can't finish the test. 12:the symbol '*' is represented as start. 15:You are the best is mean you are the no. 1. 16:The world <Happy> is the same with "glad". 18:google is the best tools for search keyword. //查找不含有the的行 [root@localhost 桌面]# grep -vn 'the' regular_express.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 4:this dress doesn't fit me. 5:However, this dress is about $ 3183 dollars. 6:GNU is free air not free beer. 7:Her hair is very beauty. 9:Oh! The soup taste good. 10:motorcycle is cheap than car. 11:This window is clear. 13:Oh! My god! 14:The gd software is a library for drafting programs. 17:I like dog. 19:goooooogle yes! 20:go! go! Let's go. 21:# I am VBird 22: [root@localhost 桌面]#
例题二:利用中括号[]来查找集合字符
//查找tast或test字符串 [root@localhost 桌面]# grep -n 't[ae]st' regular_express.txt 8:I can't finish the test. 9:Oh! The soup taste good. //查找不是以g开头的oo字符串 [root@localhost 桌面]# grep -n '[^g]oo' regular_express.txt 2:apple is my favorite food. 3:Football game is not use feet only. 18:google is the best tools for search keyword. 19:goooooogle yes! //查找数字 [root@localhost 桌面]# grep -n '[0-9]' regular_express.txt 5:However, this dress is about $ 3183 dollars. 15:You are the best is mean you are the no. 1. 查找不是以小写字母开头的oo字符串 [root@localhost 桌面]# grep -n '[^[:lower:]]oo' regular_express.txt 3:Football game is not use feet only. [root@localhost 桌面]#
例题三:行首与行尾字符^$
//以the开头的行 [root@localhost 桌面]# grep -n '^the' regular_express.txt 12:the symbol '*' is represented as start. //以小写字母开头的行 [root@localhost 桌面]# grep -n '^[a-z]' regular_express.txt 2:apple is my favorite food. 4:this dress doesn't fit me. 10:motorcycle is cheap than car. 12:the symbol '*' is represented as start. 18:google is the best tools for search keyword. 19:goooooogle yes! 20:go! go! Let's go. //以小数点结尾的(需要转义) [root@localhost 桌面]# grep -n '\.$' regular_express.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 4:this dress doesn't fit me. 10:motorcycle is cheap than car. 11:This window is clear. 12:the symbol '*' is represented as start. 15:You are the best is mean you are the no. 1. 16:The world <Happy> is the same with "glad". 17:I like dog. 18:google is the best tools for search keyword. 20:go! go! Let's go. //查找空白行 [root@localhost 桌面]# grep -n '^$' regular_express.txt 22: [root@localhost 桌面]#
例题四:任意字符.和重复字符*
.(小数点):代表一定有一个任意字符的意思
*:代表重复前一个0到无穷的意思
//查找以g开头,d结尾,中间两个字符的字符 [root@localhost 桌面]# grep -n 'g..d' regular_express.txt 1:"Open Source" is a good mechanism to develop programs. 9:Oh! The soup taste good. 16:The world <Happy> is the same with "glad". //查找至少含有两个o,后面跟0到无穷个o的字符 [root@localhost 桌面]# grep -n 'ooo*' regular_express.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 9:Oh! The soup taste good. 18:google is the best tools for search keyword. 19:goooooogle yes! [root@localhost 桌面]#
例题五:限定连续RE字符范围{}
{}必须转义
//查找o重复两次的字符 [root@localhost 桌面]# grep -n 'o\{2\}' regular_express.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 9:Oh! The soup taste good. 18:google is the best tools for search keyword. 19:goooooogle yes! //查找o重复2到5次的字符 [root@localhost 桌面]# grep -n 'o\{2,5\}' regular_express.txt 1:"Open Source" is a good mechanism to develop programs. 2:apple is my favorite food. 3:Football game is not use feet only. 9:Oh! The soup taste good. 18:google is the best tools for search keyword. 19:goooooogle yes! //查找o重复两次以上的 [root@localhost 桌面]# grep -n 'go\{2,\}g' regular_express.txt 18:google is the best tools for search keyword. 19:goooooogle yes! [root@localhost 桌面]#
基础正则表达式字符
经过上节的五个例题,可将基础的正则表达式总结如下:
RE字符 | 意义 | |
^word | 带查找的字符串在行首 | |
word$ | 待查找的字符串在行尾 | |
. | 代表一定有一个任意字符的字符 | |
\ | 转义字符 | |
* | 重复零到无穷多个前一个字符 | |
[list] | 从字符集合的RE字符里找到想要选取的字符 | |
[n1-n2] | 从字符集合的RE字符里找到想要选取的字符范围 | |
[^list] |
|
|
\{n,m\} | 前一个字符重复n到m次 |
sed工具
sed本身也是管道命令,不仅可以分析标准输出数据,还可以将数据进行替换、删除、新增和选取特定行等功能。
sed [-nefr] [动作]
参数:
-n:安静模式。默认情况下,所有来自STDIN的数据都会列在屏幕上,加上-n后,只有经过sed指令特殊处理的那一行才会显示出来
-e:直接在命令行模式上进行sed动作编辑
-f:直接将sed的动作写在一个文件内,-f filename则可以执行filename内的sed动作
-r:sed动作支持扩展性正则表达式(默认是基础型正则表达式)
-i:直接修改读取的文件内容,而不是屏幕输出
动作说明:[n1[,n2]] function
n1,n2不一定存在,一般代表选择动作的行数。
function有以下参数:
a:新增,a后面可以接字符串,而这些字符串会在新的一行出现(目前的下一行)
c:替换,c的后面可以接字符串,可以替换n1-n2行之间的行
d:删除
i:插入,后面可以接字符串,而这些字符串会在新的一行出现(目前的上一行)
p:打印
s:替换,通常搭配正则表达式
//原始文本 [root@localhost 桌面]# cat -n test.txt 1 this a test text! 2 i like linux ! 3 today is monday! 4 my name is fw. 5 //删除2-3行 [root@localhost 桌面]# cat -n test.txt | sed '2,3d' 1 this a test text! 4 my name is fw. 5 //删除第3行及后面的 [root@localhost 桌面]# cat -n test.txt | sed '3,$d' 1 this a test text! 2 i like linux ! //新增(在后面) [root@localhost 桌面]# cat -n test.txt | sed '2a this line is new' 1 this a test text! 2 i like linux ! this line is new 3 today is monday! 4 my name is fw. 5 ////插入(在前面) [root@localhost 桌面]# cat -n test.txt | sed '2i this line is new' 1 this a test text! this line is new 2 i like linux ! 3 today is monday! 4 my name is fw. 5 //替换 [root@localhost 桌面]# cat -n test.txt | sed '2c this line is new' 1 this a test text! this line is new 3 today is monday! 4 my name is fw. 5 //显示2-4行 [root@localhost 桌面]# cat -n test.txt | sed -n '2,4p' 2 i like linux ! 3 today is monday! 4 my name is fw.
查找并替换:sed ‘s/要替换的字符串/新的字符串/g’
查找字符串可以使用正则表达式
//查看原文本 [root@localhost 桌面]# cat -n test.txt 1 this a test text! 2 i like linux ! 3 today is monday! 4 my name is fw. 5 //将this替换成that [root@localhost 桌面]# cat -n test.txt | sed 's/this/that/g' 1 that a test text! 2 i like linux ! 3 today is monday! 4 my name is fw. 5 //将结尾的!替换成小数点. [root@localhost 桌面]# cat -n test.txt | sed 's/!$/\./g' 1 this a test text. 2 i like linux . 3 today is monday. 4 my name is fw. 5 //将开头的this删除 [root@localhost 桌面]# cat -n test.txt | sed 's/^.*this//g' a test text! 2 i like linux ! 3 today is monday! 4 my name is fw. 5 [root@localhost 桌面]#
直接修改文件内容:
-i参数
//查看原文件 [root@localhost 桌面]# cat test.txt this a test text! i like linux ! today is monday! my name is fw. //将this替换成that,写入原文件 [root@localhost 桌面]# sed -i 's/this/that/g' test.txt //查看原文件 [root@localhost 桌面]# cat test.txt that a test text! i like linux ! today is monday! my name is fw.
扩展正则表达式
该部分暂时略过。
文件的格式化与相关处理
格式化打印:printf
printf '打印格式' 实际内容
参数:
几个格式方面的特殊样式:
\a:警告声音输出
\b:退格键
\f:清除屏幕
\n:输出新的一行
\r:Enter按键
\t:水平Tab按键
\v:垂直Tab按键
\xNN:NN为两位数的数字,可以转换数字为字符
c程序语言内常见变量格式:
%ns:n是数字,s代表string,即多少个字符
%ni:n是数字,i代表integer,即多少个整数字数
%N.nf:n和N都是数字,f代表float
//查看原文本 [root@localhost 桌面]# cat test.txt Name Chinese English Math Average Tom 80 60 92 77.33 Sherry 75 55 80 70.00 John 60 90 70 73.33 [root@localhost 桌面]# printf '%s\t %s\t %s\t %s\t %s\t \n' $(cat test.txt) Name Chinese English Math Average Tom 80 60 92 77.33 Sherry 75 55 80 70.00 John 60 90 70 73.33 [root@localhost 桌面]# printf '%10s %5i %5i %5i %8.3f \n' $(cat test.txt) bash: printf: Chinese: 无效数字 bash: printf: English: 无效数字 bash: printf: Math: 无效数字 bash: printf: Average: 无效数字 Name 0 0 0 0.000 Tom 80 60 92 77.330 Sherry 75 55 80 70.000 John 60 90 70 73.330 //输出编码值为45的字符 [root@localhost 桌面]# printf '\x45\n' E [root@localhost 桌面]#
awk:好用的数据处理工具
awk ‘条件类型1{动作1} 条件类型2{动作2}……’ filename
[root@localhost 桌面]# last -n 5 root pts/0 :0 Mon Jul 18 14:19 still logged in root :0 :0 Mon Jul 18 14:10 still logged in (unknown :0 :0 Mon Jul 18 14:08 - 14:10 (00:01) reboot system boot 3.10.0-327.el7.x Mon Jul 18 14:08 - 16:00 (01:52) root pts/0 :0 Sun Jul 17 15:44 - crash (22:23) wtmp begins Mon Apr 25 13:36:45 2016 [root@localhost 桌面]# last -n 5 | awk '{print $1 "\t" $4}' root Mon root Mon (unknown Mon reboot 3.10.0-327.el7.x root Sun wtmp Apr [root@localhost 桌面]#
awk指令会把每一行根据空格或者tab分割,然后将所有片段依次赋值给$1,$2,……变量。
awk内置变量
NF:每行字段总数
NR:目前awk所处理的是第几行数据
FS:目前的分割字符,默认是空格
[root@localhost 桌面]# last -n 5 | awk '{print $1 "\t lines:" NR "\t cplumes:" NF}' root lines:1 cplumes:10 root lines:2 cplumes:10 (unknown lines:3 cplumes:10 reboot lines:4 cplumes:11 root lines:5 cplumes:10 lines:6 cplumes:0 wtmp lines:7 cplumes:7
awk的逻辑运算符
>:大于
<:小于
>=:大于等于
<=:小于等于
==:等于
!=:不等于
[root@localhost 桌面]# cat /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin adm:x:3:4:adm:/var/adm:/sbin/nologin lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin sync:x:5:0:sync:/sbin:/bin/sync shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown halt:x:7:0:halt:/sbin:/sbin/halt mail:x:8:12:mail:/var/spool/mail:/sbin/nologin operator:x:11:0:operator:/root:/sbin/nologin games:x:12:100:games:/usr/games:/sbin/nologin ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin //以下一“:”作为分隔符,但第一行会失效 [root@localhost 桌面]# cat /etc/passwd | \ > awk '{FS=":"} $3<10 {print $1 "\t" $3}' root:x:0:0:root:/root:/bin/bash bin 1 daemon 2 adm 3 lp 4 sync 5 shutdown 6 halt 7 mail 8 //以下利用BEGIN预先设置变量,第一行便不会失效 [root@localhost 桌面]# cat /etc/passwd | \ > awk 'BEGIN {FS=":"} $3<10 {print $1 "\t" $3}' root 0 bin 1 daemon 2 adm 3 lp 4 sync 5 shutdown 6 halt 7 mail 8 [root@localhost 桌面]#
awk的计算功能
//查看原文本 [root@localhost 桌面]# cat pay.txt Name 1st 2nd 3th Tom 2300 3200 1200 Sherry 3400 1200 7400 //在awk中变量可以直接使用,不需要$,awk的{}动作内若有多个命令辅助时,使用“;”分隔 [root@localhost 桌面]# cat pay.txt | \ > awk 'NR==1{printf "%10s %10s %10s %10s %10s \n",$1,$2,$3,$4,"Total"} > NR>=2{total=$2+$3+$4;printf "%10s %10d %10d %10d %10.2f \n",$1,$2,$3,$4,total}' Name 1st 2nd 3th Total Tom 2300 3200 1200 6700.00 Sherry 3400 1200 7400 12000.00
文件比较工具
diff
用于相似文件的比较。
diff [-bBi] fileA fileB
参数:
-b:忽略一行中多个空格的区别
-B:忽略空白行的区别
-i:忽略大小写区别
[root@localhost 桌面]# vim fileA [root@localhost 桌面]# cp fileA fileB [root@localhost 桌面]# vim fileB [root@localhost 桌面]# cat fileA this is fileA [root@localhost 桌面]# cat fileB this is fileB ok [root@localhost 桌面]# diff fileA fileB 1,2c1 < this is fileA < --- > this is fileB 3a3 > ok [root@localhost 桌面]#
patch
该命令与diff密不可分,加入fileA和fileB是两个不同版本的文件,想用fileB来更新fileA,则先通过diff比较两个文件的区别,并将区别文件制作成补丁文件,再由补丁文件更新旧文件。
patch -pN < patchFile 《==更新
patch -R -pN < patchFile 《==还原
参数:
-p:后面N表示取消几层目录
-R:代表还原
[root@localhost 桌面]# cat fileA this is fileA [root@localhost 桌面]# cat fileB this is fileB ok //制作补丁文件 [root@localhost 桌面]# diff -Naur fileA fileB > file.patch [root@localhost 桌面]# cat file.patch --- fileA 2016-07-18 16:36:24.371373349 +0800 +++ fileB 2016-07-18 16:37:31.523401652 +0800 @@ -1,3 +1,3 @@ -this is fileA - +this is fileB +ok //使用补丁文件更新旧文件,因为在当前目录,因此N为0 [root@localhost 桌面]# patch -p0 < file.patch patching file fileA [root@localhost 桌面]# cat fileA this is fileB ok [root@localhost 桌面]#