【sed / awk脚本编写】

2018-08-07 15:56 ZealouSnesS 阅读(617) 评论(0) 收藏举报

awk

awk分为BEGIN部分，正则匹配部分，END部分三部分。

我一般在BEGIN部分定义一些变量，正则部分用于匹配和执行一些解析和统计，END部分用于输出结果。

总体结构：

awk 'BEGIN{xxxx;xxxx;}{xxxx;xxxx;} /匹配字符串/{xxxx;xxxx;} END{xxxx;xxxx;}' 待处理的输入文件
或
其他命令的结果 | awk 'BEGIN{xxxx;xxxx;}{xxxx;xxxx;} /匹配字符串/{xxxx;xxxx;} END{xxxx;xxxx;}'

示例一：

抓取网页test.html中的身份证号/手机号，输入：一个html文件，输出：一行：网页url 身份证号个数手机号个数找到的身份证号找到的手机号

cat test.html | sed -e 's/\(^\|[^0-9]\)\(13[0-9][0-9]\{8\}\|14[579][0-9]\{8\}\|15[0-3,5-9][0-9]\{8\}\|16[6][0-9]\{8\}\|17[0135678][0-9]\{8\}\|18[0-9][0-9]\{8\}\|19[89][0-9]\{8\}\)\($\|[^0-9]\)/\nfind_phone:\2\n/g' | sed -e 's/\(^\|[^0-9]\)\([0-9]\{6\}[1-2][0-9]\{3\}\(\(0[1-9]\)\|\(10\|11\|12\)\)\(\([0-2][1-9]\)\|10\|20\|30\|31\)[0-9]\{3\}[0-9Xx]\)\($\|[^0-9]\)/\nfind_idcard:\2\n/g'| awk 'BEGIN{OFS=":";idcard_count=0;phone_count=0;}/Url/{if(NR<10) {url=$3;}}/find_phone/{phone_count++;phone_list[phone_count]=$1;}/find_idcard/{idcard_count++;idcard_list[idcard_count]=$1;}END{if(idcard_count!=0||phone_count!=0){printf "%s",url;for(i=0;i<length(url)/7;i++){printf "\t"};printf "%d\t%d\t",idcard_count,phone_count;for(i=1;i<=phone_count;i++){printf phone_list[i];printf "\t";};for(i=1;i<=idcard_count;i++){printf idcard_list[i];printf "\t";};print "";}}'

示例二：

num.txt：

1
2
3

加法: cat num.txt | awk '{sum += $1} END {print sum}'
輸出: 6

示例三：输入：示例一的输出输出：url 是否有身份证号是否有手机号 1条匹配到的身份证 1条匹配到的手机号

运行命令：
输出到文件
[spider@]$ cat sample1.output | sh test_filter.sh > sample3.output
输出到控制台用less查看
[spider@]$ cat sample1.output | sh test_filter.sh | less
less命令：
less命令的作用与more十分相似，都可以用来浏览文字档案的内容，不同的是less命令允许用户向前或向后浏览文件，而more命令只能向前浏览。用less命令显示文件时，用PageUp键向上翻页，用PageDown键向下翻页。要退出less程序，应按Q键。


#!/bin/bash

awk '{
    id_num=0;
    phone_num=0;
    temp_id="NULL";
    temp_phone="NULL";
    for(i=4;i<NF;i++){
        if($i~/find_phone/) {
            phone_num++;
            temp_phone=$i;
        }
        if($i~/find_idcard/) {
            if(!($i~/find_idcard:20/)) {
                id_num++;
                temp_id=$i;
            }
        }
    }
    
    if(id_num>0){
        has_id="true"
    }else{
        has_id="false"
    }
    if(phone_num>0){
        has_phone="true"
    }else{
        has_phone="false"
    }

    print substr($1,1,length($1)-1)"\t"has_id"\t"has_phone"\t"temp_id"\t"temp_phone
}' | awk '{
    if($1!="NULL" && ($2=="true" || $3=="true")) {
        print $0
    }
}'

用substr函数切掉url字符串的最后一个^M字符，防止输出时覆盖

1、使用过程中难点主要是对不同行做不同的解析和输出处理：

http://bbs.chinaunix.net/thread-4186958-1-1.html

2、

awk '{print NR}' filename; //打印行号

awk ‘{print $0}' filename; //打印整行

3、if语句

[chengmo@localhost nginx]# awk 'BEGIN{ 
test=100;
if(test>90)
{
    print "very good";
}
else if(test>60)
{
    print "good";
}
else
{
    print "no pass";
}
}'

very good

更多的if/for/while/do操作：

https://www.cnblogs.com/chengmo/archive/2010/10/04/1842073.html

4、awk的变量（包括数组）不需要声明可以直接使用

https://www.cnblogs.com/pangbing/p/7015745.html

5、awk中判断字符长度： length(变量名)

http://bbs.chinaunix.net/thread-271694-1-1.html

6、查询文件行数

[spider@zhangsuosheng]$ cat test2.html 
dddd
bbb131102198910084421ccc eee13611112222fff13133334444
h15855556666j
aaaa
 13177778888 
13199990000 
 18611112222
370785199507319527
[spider@zhangsuosheng]$ cat test2.html |awk 'BEGIN{counter=0;}{counter++;}END{print counter;}'
8

sed

利用sed将文件中每一行复制十遍

sed 's/\(.*\)/\1\n\1\n\1\n\1\n\1\n\1\n\1\n\1\n\1\n\1/g' > ${file}

将上一级目录中的1.txt 2.txt 3.txt 4.txt 几个文件每行复制十遍写入当前目录的对应文件中

shell数组遍历黄色底色

shell 字符串拼接红字

sed 提取匹配到的部分，用扩出来要提取的，用\1 \2提取这些括号中匹配到的

#!/bin/bash

file_list=('1.txt' '2.txt' '3.txt' '4.txt' )

for file in ${file_list[@]}
do
    temp='../'${file}
    cat ${temp} | sed 's/\(.*\)/\1\n\1\n\1\n\1\n\1\n\1\n\1\n\1\n\1\n\1/g' > ${file}
    ../data_tools shuffle -if1 ${file} -of1 ${file}
done

利用sed命令在匹配某特定字符串的行尾添加字符串

shortest match by default

把txt文件中包含test行的行尾添加 ‘000’

sed -i '/test/ s/$/000/'

刷新页面返回顶部

ZealouSnesS

【sed / awk脚本编写】

awk

sed

利用sed将文件中每一行复制十遍

将上一级目录中的1.txt 2.txt 3.txt 4.txt 几个文件每行复制十遍写入当前目录的对应文件中

shell数组遍历黄色底色

shell 字符串拼接红字

sed 提取匹配到的部分，用\( \)扩出来要提取的，用\1 \2提取这些括号中匹配到的

利用sed命令在匹配某特定字符串的行尾添加字符串

About

ZealouSnesS

【sed / awk脚本编写】

awk

sed

利用sed将文件中每一行复制十遍

将上一级目录中的1.txt 2.txt 3.txt 4.txt 几个文件每行复制十遍写入当前目录的对应文件中

shell数组遍历 黄色底色

shell 字符串拼接 红字

sed 提取匹配到的部分，用\( \)扩出来要提取的，用\1 \2提取这些括号中匹配到的

利用sed命令在匹配某特定字符串的行尾添加字符串

About

shell数组遍历黄色底色

shell 字符串拼接红字