关于FDNS数据抽取的一些记录

大体上来说，流程是：

下载 FDNS 文件（Aria2）
提取出 cn，com，org，net ，gov.cn，edu.cn 的记录入库

以下是抽取过程中用到的部分命令。

awk

awk是一个强大的文本分析工具。简单来说awk是把文件逐行的读入，以空格为默认分隔符将每行切片，切开的部分再进行各种分析处理。通常，awk是以文件的一行为处理单位的。

使用方法： awk 'pattern {action}' {filenames}

pattern表示awk在数据中查找的内容，action是找到匹配内容时所执行的一系列命令。

1，无action时，默认输出行内容。

2，-F 参数设置输入域分隔符，将一行用指定字符分隔开

3，awk -F ':' '/root/ {print FILENAME, NR, NF, $1, $2}' /etc/passwd

解释：将/etc/passwd文件每行读入，找到包含/root/的行，用字符: 切开，打印文件名、第几行、第几列、分隔后的第一项、分割后的第二项。PS：$0表示整行

4，awk -F ':' '{print FILENAME, NR, NF, $1, $2}' /etc/passwd

解释：和上面一样，就是没有pattern，对全部行做action

5，awk -F ':' '{if(NR>10 and NR<20) print $1} /etc/passwd'

解释：没有pattern，打印10-20行第一列

6，使用printf替代print,可以让代码更加简洁，易读：awk -F ':' '{printf("%s,%s,%s,%s\n", $1, $2, $3, $4) >> $5}' /etc/passwd 。把分割后前四项追加到以第五项命名的文件中。

parallel

--pipe 将输入（stdin）分为多块（block），然后分配给多个cpu并行执行，最后的结果顺序与原始顺序一致。

--block 参数可以指定每块的大小，默认为1M。

zcat 20200221-fdns.json.gz | parallel --pipe --block 50M python demo.py ，把输入划分为50M一块并行执行。

zcat

在不解压文件的情况下，把文件内容输出到标准输出。（原压缩文件不做任何更改）

zcat 20200221-fdns.json.gz | wc -l

pg_bulkload

PostgreSQL数据库的命令，从文件向数据库导数据。

pg_bulkload -i ${working_dir}/${i} -o "TABLE=a_record_$i" -o "TYPE=CSV" -o "WRITER=PARALLEL" -d ${current_db} -Ufdns -l /tmp/a_record_${i}.log

没看懂，用时再查。https://www.cnblogs.com/lottu/p/9319016.html

SQL insert已存在时的更新操作

insert时，如果记录不存在则完成插入，已存在则更新指定列：

insert into table_name(subdomain, domain, ip, timestamp) values('...') on conflict(subdomain) do update set ip=excluded.ip, timestamp=excluded.timestamp

解释：subdomain已存在时，更新它的ip、timestamp字段

psql

1，登录远程服务器的PG

psql -h 192.168.199.17 -p 5432 socweb postgres

2，执行sql文件中命令，无需先登录数据库

psql -U postgres -d tcp_scans -f 1.sql

3，在Shell中直接创建、删除PG数据库

createdb -U postgres -O postgres abc  # -O 指定拥有者Owner
dropdb -U postgres abc  # 删除abd库

正向索引和反向索引的一点理解

（不知道对不对哈，待求证）

在subdomain列上建立正向索引，只可用select * from table where subdomain like 'abc%'，对like '%abc'无效，要想也生效，需要建立subdomain的反向索引。

一段优雅的PG库代码

def get_conn():
    return psycopg2.connect(host='host', database='database', 
                            user='user', password='password')

with get_conn() as conn:
      with conn.cursor() as curs:
            for line in fp:
                  name, domain, ip, timestamp = line.strip().split(',')
                  sql = "insert into {} (name,domain,ip,timestamp) " \
                        "values ('{}','{}','{}',{}) on conflict (name) do update " \
                        "set ip=excluded.ip, timestamp=excluded.timestamp;".format(
                              table, name, domain, ip, int(timestamp))
                  curs.execute(sql)
                  count += 1
                  if count % 1000 == 0:
                        conn.commit()
      conn.commit()

posted @ 2020-03-17 15:19 961897 阅读(811) 评论(0) 编辑收藏举报

刷新页面返回顶部

res