FOFA链接爬虫爬取fofa spider

之前一直是用的github上别人爬取fofa的脚本,前两天用的时候只能爬取第一页的链接了,猜测是fofa修改了一部分规则(或者是我不小心删除了一部分文件导致不能正常运行了)

于是重新写了一下爬取fofa的代码,写的不好:(

因为fofa的登录界面是https://i.nosec.org/login?service=https%3A%2F%2Ffofa.so%2Fusers%2Fservice

 

 FOFA的登录跟一般网站登录不同,在nosec登录成功后,只拥有nosec的cookie,并没有fofa的cookie,所以访问fofa还是未登录状态,需要再访问https://fofa.so/users/sign_in才会生成fofa的cookie。

然后我就换了一种方式,手动添加_fofapro_ars_session来进行登录,fofapro_ars_session在我们登录fofa之后使用F12可以查看,这一步比较麻烦

添加了对应的session之后,我们对输入内容进行base64编码,因为当我们在fofa网站进行搜索的时候,网站也是将我们输入的内容进行base64编码然后进行搜索的

接着解析页面获取相应链接,持续找到下一页即可。

需要注意的是,因为fofa也有防止快速爬取的机制,所以我们在爬取的时候要设置一点延时,防止抓取到的IP地址有漏掉的。

在检索到了搜索的内容之后,首先显示该搜索对象有多少页,爬取的页数也是由输入者自己决定。

代码如下:(有一个漂亮的字符画大LOGO)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
<code-pre class="code-pre" id="pre-tnsnT5"><code-line class="line-numbers-rows"></code-line># -*- coding:utf-8 -*-
<code-line class="line-numbers-rows"></code-line>import requests
<code-line class="line-numbers-rows"></code-line>from lxml import etree
<code-line class="line-numbers-rows"></code-line>import base64
<code-line class="line-numbers-rows"></code-line>import re
<code-line class="line-numbers-rows"></code-line>import time
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>cookie = ''
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>def logo():
<code-line class="line-numbers-rows"></code-line>    print('''
<code-line class="line-numbers-rows"></code-line>               
<code-line class="line-numbers-rows"></code-line>           
<code-line class="line-numbers-rows"></code-line>             /$$$$$$$$ /$$$$$$  /$$$$$$$$ /$$$$$$                                  
<code-line class="line-numbers-rows"></code-line>            | $$_____//$$__  $$| $$_____//$$__  $$                                 
<code-line class="line-numbers-rows"></code-line>            | $$     | $$  \ $$| $$     | $$  \ $$                                 
<code-line class="line-numbers-rows"></code-line>            | $$$$$  | $$  | $$| $$$$$  | $$$$$$$$                                 
<code-line class="line-numbers-rows"></code-line>            | $$__/  | $$  | $$| $$__/  | $$__  $$                                 
<code-line class="line-numbers-rows"></code-line>            | $$     | $$  | $$| $$     | $$  | $$                                 
<code-line class="line-numbers-rows"></code-line>            | $$     |  $$$$$$/| $$     | $$  | $$                                 
<code-line class="line-numbers-rows"></code-line>            |__/      \______/ |__/     |__/  |__/                                 
<code-line class="line-numbers-rows"></code-line>                                                                                   
<code-line class="line-numbers-rows"></code-line>                                                                                   
<code-line class="line-numbers-rows"></code-line>                                                                                   
<code-line class="line-numbers-rows"></code-line>                                /$$$$$$            /$$       /$$                   
<code-line class="line-numbers-rows"></code-line>                               /$$__  $$          |__/      | $$                   
<code-line class="line-numbers-rows"></code-line>                              | $$  \__/  /$$$$$$  /$$  /$$$$$$$  /$$$$$$   /$$$$$$
<code-line class="line-numbers-rows"></code-line>                              |  $$$$$$  /$$__  $$| $$ /$$__  $$ /$$__  $$ /$$__  $$
<code-line class="line-numbers-rows"></code-line>                               \____  $$| $$  \ $$| $$| $$  | $$| $$$$$$$$| $$  \__/
<code-line class="line-numbers-rows"></code-line>                               /$$  \ $$| $$  | $$| $$| $$  | $$| $$_____/| $$     
<code-line class="line-numbers-rows"></code-line>                              |  $$$$$$/| $$$$$$$/| $$|  $$$$$$$|  $$$$$$$| $$     
<code-line class="line-numbers-rows"></code-line>                               \______/ | $$____/ |__/ \_______/ \_______/|__/     
<code-line class="line-numbers-rows"></code-line>                                        | $$                                       
<code-line class="line-numbers-rows"></code-line>                                        | $$                                       
<code-line class="line-numbers-rows"></code-line>                                        |__/                                       
<code-line class="line-numbers-rows"></code-line>                               
<code-line class="line-numbers-rows"></code-line>                                                                                version:1.0
<code-line class="line-numbers-rows"></code-line>    ''')
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>def spider():
<code-line class="line-numbers-rows"></code-line>    header = {
<code-line class="line-numbers-rows"></code-line>        "Connection": "keep-alive",
<code-line class="line-numbers-rows"></code-line>        "Cookie": "_fofapro_ars_session=" + cookie,
<code-line class="line-numbers-rows"></code-line>    }
<code-line class="line-numbers-rows"></code-line>    search = input('please input your key: \n')
<code-line class="line-numbers-rows"></code-line>    searchbs64 = (str(base64.b64encode(search.encode('utf-8')), 'utf-8'))
<code-line class="line-numbers-rows"></code-line>    print("spider website is :https://fofa.so/result?&qbase64=" + searchbs64)
<code-line class="line-numbers-rows"></code-line>    html = requests.get(url="https://fofa.so/result?&qbase64=" + searchbs64, headers=header).text
<code-line class="line-numbers-rows"></code-line>    pagenum = re.findall('>(\d*)</a> <a class="next_page" rel="next"', html)
<code-line class="line-numbers-rows"></code-line>    print("have page: "+pagenum[0])
<code-line class="line-numbers-rows"></code-line>    stop_page=input("please input stop page: \n")
<code-line class="line-numbers-rows"></code-line>    #print(stop_page)
<code-line class="line-numbers-rows"></code-line>    doc = open("hello_world.txt", "a+")
<code-line class="line-numbers-rows"></code-line>    for i in range(1,int(pagenum[0])):
<code-line class="line-numbers-rows"></code-line>        print("Now write " + str(i) + " page")
<code-line class="line-numbers-rows"></code-line>        pageurl = requests.get('https://fofa.so/result?page=' + str(i) + '&qbase64=' + searchbs64, headers=header)
<code-line class="line-numbers-rows"></code-line>        tree = etree.HTML(pageurl.text)
<code-line class="line-numbers-rows"></code-line>        urllist=tree.xpath('//div[@class="list_mod_t"]//a[@target="_blank"]/@href')
<code-line class="line-numbers-rows"></code-line>        for j in urllist:
<code-line class="line-numbers-rows"></code-line>            #print(j)
<code-line class="line-numbers-rows"></code-line>            doc.write(j+"\n")
<code-line class="line-numbers-rows"></code-line>        if i==int(stop_page):
<code-line class="line-numbers-rows"></code-line>            break
<code-line class="line-numbers-rows"></code-line>        time.sleep(10)
<code-line class="line-numbers-rows"></code-line>    doc.close()
<code-line class="line-numbers-rows"></code-line>    print("OK,Spider is End .")
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>def start():
<code-line class="line-numbers-rows"></code-line>    print("Hello!My name is Spring bird.First you should make sure _fofapro_ars_session!!!")
<code-line class="line-numbers-rows"></code-line>    print("And time sleep is 10s")
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>def main():
<code-line class="line-numbers-rows"></code-line>    logo()
<code-line class="line-numbers-rows"></code-line>    start()
<code-line class="line-numbers-rows"></code-line>    spider()
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>if __name__ == '__main__':
<code-line class="line-numbers-rows"></code-line>    main()
</code-pre>

  Github链接:https://github.com/Cl0udG0d/Fofa-script

我设置的time.sleep()延时是10秒,可以根据自己的需求进行修改,以及,虽然在代码里面进行了base64解码,但是有的时候总会出现编码问题而导致搜索不到想要的结果,pagenum[0]等于0的情况,如果修改关键字还是不行的话,可以自己在fofa网站里面查了之后,在url中将base64之后的搜索关键字替换成代码里面的searchbs64,这样就必然能够搜索到了,这些不足的地方在下次修改的时候进行改进吧,奥利给。


__EOF__

本文作者春告鳥
本文链接https://www.cnblogs.com/Cl0ud/p/12384457.html
关于博主:评论和私信会在第一时间回复。或者直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角推荐一下。您的鼓励是博主的最大动力!
posted @   春告鳥  阅读(4032)  评论(0编辑  收藏  举报
编辑推荐:
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
阅读排行:
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!
点击右上角即可分享
微信分享提示