Awk基本入门[2] Awk Built-in Variables

1、FS - Input Field Separator


awk处理文档时,默认的域分隔符为空格,所以如果你的输入文件的域分隔符不是空格,可以通过-F选项来指定分隔符,如下所示:

awk -F ',' '{print $2, $3}' employee.txt

我们也可以使用awk内置变量FS来设置分隔符,需要在BEGIN块里设置:

awk 'BEGIN {FS=","} {print $2, $3}' employee.txt

我们还可以指定多个域分隔符,例如存在以下记录文件,其中的每条记录包含3个不同的域分隔符:逗号、冒号和百分号:

$ vi employee-multiple-fs.txt
101,John Doe:CEO%10000
102,Jason Smith:IT Manager%5000
103,Raj Reddy:Sysadmin%4500
104,Anand Ram:Developer%4500
105,Jane Miller:Sales Manager%3000

You can specify MULTIPLE field separators using a regular expression. For example FS = "[,:%]" indicates that the field separator can be , or : or %

So, the following example will print the name and the title from the employee-multiple-fs.txt file that contains different field separators.

$ awk 'BEGIN {FS="[,:%]"} {print $2, $3}' \
employee-multiple-fs.txt
John Doe CEO
Jason Smith IT Manager
Raj Reddy Sysadmin
Anand Ram Developer
Jane Miller Sales Manager

2、FIELDWIDTHS


 

awk默认使用FS指定的字符(串或正则表达式)作为输入域分隔依据,但是也可以使用FIELDWIDTHS指定每一列的宽度以分隔输入域,例如:

$ echo abcdefghigk | awk 'BEGIN{FIELDWIDTHS="1 2"} {$1=$1;print $0}'
a bc
$ echo abcdefghigk | awk 'BEGIN{FIELDWIDTHS="1 2 3"} {$1=$1;print $0}'
a bc def

参考:http://www.gnu.org/software/gawk/manual/html_node/Constant-Size.html

 

3、FPAT


 

假设存在以下的scv文件(逗号分隔值),内容为如下格式:

$ cat addresses.csv
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

注意到其中的地址字段("1234 A Pretty Street, NE")中包含了一个“,”,如果采用了FS=","来分隔输入域,则地址会被拆分成两部分:

"1234 A Pretty Street NE

这不是我们想要的结果

针对这样的场景,我么可以使用内置变量FPAT来解决问题。FPAT的值是一个正则表达式,该正则表达式描述了每一个域的内容。

针对上述场景中的csv文件,每个域或者是不包含","的字符串,或者是由一对双引号括起来的字符串。

因此,我们可以这样来解决:

$ cat simple-csv.awk 
BEGIN {
         FPAT = "([^,]+)|(\"[^\"]+\")"
     }
     
     {
         print "NF = ", NF
         for (i = 1; i <= NF; i++) {
             printf("$%d = <%s>\n", i, $i)
         }
     }
 $ gawk -f simple-csv.awk addresses.csv
NF =  7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>

可以看到地址被作为一个域而存在了。

参考:http://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

3、OFS - Output Field Separator


OFS表示输出分隔符,用以在输出时作为连续域之间的分隔符。默认的域分隔符为空格。

When you use a single print statement to print two 

variables by separating them with comma (as shown below), it will print the values of those two variables separated by space.

$ awk -F ',' '{print $2, $3}' employee.txt
John Doe CEO
Jason Smith IT Manager
Raj Reddy Sysadmin
Anand Ram Developer
Jane Miller Sales Manager

The following print statement is printing two variables ($2 and $4) separated by comma, however the output will have colon separating them (instead of space), as our OFS is set to colon.

$ awk -F ',' 'BEGIN { OFS=":" } \
{ print $2, $3 }' employee.txt
John Doe:CEO
Jason Smith:IT Manager
Raj Reddy:Sysadmin
Anand Ram:Developer
Jane Miller:Sales Manager

When you specify a comma in the print statement between different print values, awk will use the OFS. In the following example, the default OFS is used, so you'll see a space between the values in the output.

$ awk 'BEGIN { print "test1","test2" }'
test1 test2

When you don't separate values with a comma in the print statement, awk will not use the OFS; instead it will print the values with nothing in between.

$ awk 'BEGIN { print "test1" "test2" }'
test1test2

4、RS - Record Separator


 假设存在以下的数据文件:

$ vi employee-one-line.txt
101,John Doe:102,Jason Smith:103,Raj Reddy:104,Anand
Ram:105,Jane Miller

在这个文件中,每条记录由两部分组成(编号和姓名),记录之间用冒号分隔而非换行,而每条记录中的两个域则由逗号分隔。

awk默认使用换行作为记录分隔符,如果你试图只打印所有员工的姓名,则以下方法是行不通的:

$ awk -F, '{print $2}' employee-one-line.txt
John Doe:102

这是因为awk将整行文本作为一条记录,而且逗号作为域分隔符,所以第二个域就是John Doe:102。所以如果想要将整行文本作为5条记录来处理,需要显示的指定记录分隔符:

$ awk -F, 'BEGIN { RS=":" } \
{ print $2 }' employee-one-line.txt
John Doe
Jason Smith
Raj Reddy
Anand Ram
Jane Miller

5、ORS - Output Record Separator


 

默认情况下,awk在输出记录时使用换行来分隔每条记录,可以通过指定变量ORS来显示的指定输出记录分隔符:

$ awk 'BEGIN { FS=","; ORS="\n---\n" } \
{print $2, $3}' employee.txt
John Doe CEO
---
Jason Smith IT Manager
---
Raj Reddy Sysadmin
---
Anand Ram Developer
---
Jane Miller Sales Manager
---

6、NR - Number of Records


 

NR is very helpful. When used inside the loop, this gives the line number. When used in the END block, this gives the total number of records in the file.

The following example shows how NR works in the body block,and in the END block:

$ awk 'BEGIN {FS=","} \
{print "Emp Id of record number",NR,"is",$1;} \
END {print "Total number of records:",NR}' employee.txt
Emp Id of record number 1 is 101
Emp Id of record number 2 is 102
Emp Id of record number 3 is 103
Emp Id of record number 4 is 104
Emp Id of record number 5 is 105
Total number of records: 5

7、FILENAME – Current File Name


 

FILENAME is helpful when you are specifying multiple input-files to the awk program. This will give you the name of the file Awk is currently processing.

$ awk '{ print FILENAME }' \
employee.txt employee-multiple-fs.txt
employee.txt
employee.txt
employee.txt
employee.txt
employee.txt
employee-multiple-fs.txt
employee-multiple-fs.txt
employee-multiple-fs.txt
employee-multiple-fs.txt
employee-multiple-fs.txt

8、FNR - File "Number of Record"


 

NR keeps
growing between multiple files. When the body block starts processing the 2nd file, NR will not be reset to 1, instead it will continue from the last NR number value of the previous file.

FNR will give you record number within the current file. So, when awk finishes executing the body block for the 1st file and starts the body block the next file, FNR will start from 1 again.

The following example shows both NR and FNR:

$ vi fnr.awk
BEGIN {
FS=","
}
{
printf "FILENAME=%s NR=%s FNR=%s\n", FILENAME, NR,
FNR;
}
END {
printf "END Block: NR=%s FNR=%s\n", NR, FNR
}
$ awk -f fnr.awk employee.txt employee-multiple-fs.txt
FILENAME=employee.txt NR=1 FNR=1
FILENAME=employee.txt NR=2 FNR=2
FILENAME=employee.txt NR=3 FNR=3
FILENAME=employee.txt NR=4 FNR=4
FILENAME=employee.txt NR=5 FNR=5
FILENAME=employee-multiple-fs.txt NR=6 FNR=1
FILENAME=employee-multiple-fs.txt NR=7 FNR=2
FILENAME=employee-multiple-fs.txt NR=8 FNR=3
FILENAME=employee-multiple-fs.txt NR=9 FNR=4
FILENAME=employee-multiple-fs.txt NR=10 FNR=5
END Block: NR=10 FNR=5

 

posted @ 2013-06-07 16:34  风*依旧  阅读(534)  评论(0编辑  收藏  举报