FOFA链接爬虫爬取fofa spider
之前一直是用的github上别人爬取fofa的脚本,前两天用的时候只能爬取第一页的链接了,猜测是fofa修改了一部分规则(或者是我不小心删除了一部分文件导致不能正常运行了)
于是重新写了一下爬取fofa的代码,写的不好:(
因为fofa的登录界面是https://i.nosec.org/login?service=https%3A%2F%2Ffofa.so%2Fusers%2Fservice
FOFA的登录跟一般网站登录不同,在nosec登录成功后,只拥有nosec的cookie,并没有fofa的cookie,所以访问fofa还是未登录状态,需要再访问https://fofa.so/users/sign_in才会生成fofa的cookie。
然后我就换了一种方式,手动添加_fofapro_ars_session来进行登录,fofapro_ars_session在我们登录fofa之后使用F12可以查看,这一步比较麻烦
添加了对应的session之后,我们对输入内容进行base64编码,因为当我们在fofa网站进行搜索的时候,网站也是将我们输入的内容进行base64编码然后进行搜索的
接着解析页面获取相应链接,持续找到下一页即可。
需要注意的是,因为fofa也有防止快速爬取的机制,所以我们在爬取的时候要设置一点延时,防止抓取到的IP地址有漏掉的。
在检索到了搜索的内容之后,首先显示该搜索对象有多少页,爬取的页数也是由输入者自己决定。
代码如下:(有一个漂亮的字符画大LOGO)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | <code - pre class = "code-pre" id = "pre-tnsnT5" ><code - line class = "line-numbers-rows" >< / code - line> # -*- coding:utf-8 -*- <code - line class = "line-numbers-rows" >< / code - line> import requests <code - line class = "line-numbers-rows" >< / code - line> from lxml import etree <code - line class = "line-numbers-rows" >< / code - line> import base64 <code - line class = "line-numbers-rows" >< / code - line> import re <code - line class = "line-numbers-rows" >< / code - line> import time <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line>cookie = '' <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> def logo(): <code - line class = "line-numbers-rows" >< / code - line> print ( ''' <code-line class="line-numbers-rows"></code-line> <code-line class="line-numbers-rows"></code-line> <code-line class="line-numbers-rows"></code-line> /$$$$$$$$ /$$$$$$ /$$$$$$$$ /$$$$$$ <code-line class="line-numbers-rows"></code-line> | $$_____//$$__ $$| $$_____//$$__ $$ <code-line class="line-numbers-rows"></code-line> | $$ | $$ \ $$| $$ | $$ \ $$ <code-line class="line-numbers-rows"></code-line> | $$$$$ | $$ | $$| $$$$$ | $$$$$$$$ <code-line class="line-numbers-rows"></code-line> | $$__/ | $$ | $$| $$__/ | $$__ $$ <code-line class="line-numbers-rows"></code-line> | $$ | $$ | $$| $$ | $$ | $$ <code-line class="line-numbers-rows"></code-line> | $$ | $$$$$$/| $$ | $$ | $$ <code-line class="line-numbers-rows"></code-line> |__/ \______/ |__/ |__/ |__/ <code-line class="line-numbers-rows"></code-line> <code-line class="line-numbers-rows"></code-line> <code-line class="line-numbers-rows"></code-line> <code-line class="line-numbers-rows"></code-line> /$$$$$$ /$$ /$$ <code-line class="line-numbers-rows"></code-line> /$$__ $$ |__/ | $$ <code-line class="line-numbers-rows"></code-line> | $$ \__/ /$$$$$$ /$$ /$$$$$$$ /$$$$$$ /$$$$$$ <code-line class="line-numbers-rows"></code-line> | $$$$$$ /$$__ $$| $$ /$$__ $$ /$$__ $$ /$$__ $$ <code-line class="line-numbers-rows"></code-line> \____ $$| $$ \ $$| $$| $$ | $$| $$$$$$$$| $$ \__/ <code-line class="line-numbers-rows"></code-line> /$$ \ $$| $$ | $$| $$| $$ | $$| $$_____/| $$ <code-line class="line-numbers-rows"></code-line> | $$$$$$/| $$$$$$$/| $$| $$$$$$$| $$$$$$$| $$ <code-line class="line-numbers-rows"></code-line> \______/ | $$____/ |__/ \_______/ \_______/|__/ <code-line class="line-numbers-rows"></code-line> | $$ <code-line class="line-numbers-rows"></code-line> | $$ <code-line class="line-numbers-rows"></code-line> |__/ <code-line class="line-numbers-rows"></code-line> <code-line class="line-numbers-rows"></code-line> version:1.0 <code-line class="line-numbers-rows"></code-line> ''' ) <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> def spider(): <code - line class = "line-numbers-rows" >< / code - line> header = { <code - line class = "line-numbers-rows" >< / code - line> "Connection" : "keep-alive" , <code - line class = "line-numbers-rows" >< / code - line> "Cookie" : "_fofapro_ars_session=" + cookie, <code - line class = "line-numbers-rows" >< / code - line> } <code - line class = "line-numbers-rows" >< / code - line> search = input ( 'please input your key: \n' ) <code - line class = "line-numbers-rows" >< / code - line> searchbs64 = ( str (base64.b64encode(search.encode( 'utf-8' )), 'utf-8' )) <code - line class = "line-numbers-rows" >< / code - line> print ( "spider website is :https://fofa.so/result?&qbase64=" + searchbs64) <code - line class = "line-numbers-rows" >< / code - line> html = requests.get(url = "https://fofa.so/result?&qbase64=" + searchbs64, headers = header).text <code - line class = "line-numbers-rows" >< / code - line> pagenum = re.findall( '>(\d*)</a> <a class="next_page" rel="next"' , html) <code - line class = "line-numbers-rows" >< / code - line> print ( "have page: " + pagenum[ 0 ]) <code - line class = "line-numbers-rows" >< / code - line> stop_page = input ( "please input stop page: \n" ) <code - line class = "line-numbers-rows" >< / code - line> #print(stop_page) <code - line class = "line-numbers-rows" >< / code - line> doc = open ( "hello_world.txt" , "a+" ) <code - line class = "line-numbers-rows" >< / code - line> for i in range ( 1 , int (pagenum[ 0 ])): <code - line class = "line-numbers-rows" >< / code - line> print ( "Now write " + str (i) + " page" ) <code - line class = "line-numbers-rows" >< / code - line> pageurl = requests.get( 'https://fofa.so/result?page=' + str (i) + '&qbase64=' + searchbs64, headers = header) <code - line class = "line-numbers-rows" >< / code - line> tree = etree.HTML(pageurl.text) <code - line class = "line-numbers-rows" >< / code - line> urllist = tree.xpath( '//div[@class="list_mod_t"]//a[@target="_blank"]/@href' ) <code - line class = "line-numbers-rows" >< / code - line> for j in urllist: <code - line class = "line-numbers-rows" >< / code - line> #print(j) <code - line class = "line-numbers-rows" >< / code - line> doc.write(j + "\n" ) <code - line class = "line-numbers-rows" >< / code - line> if i = = int (stop_page): <code - line class = "line-numbers-rows" >< / code - line> break <code - line class = "line-numbers-rows" >< / code - line> time.sleep( 10 ) <code - line class = "line-numbers-rows" >< / code - line> doc.close() <code - line class = "line-numbers-rows" >< / code - line> print ( "OK,Spider is End ." ) <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> def start(): <code - line class = "line-numbers-rows" >< / code - line> print ( "Hello!My name is Spring bird.First you should make sure _fofapro_ars_session!!!" ) <code - line class = "line-numbers-rows" >< / code - line> print ( "And time sleep is 10s" ) <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> def main(): <code - line class = "line-numbers-rows" >< / code - line> logo() <code - line class = "line-numbers-rows" >< / code - line> start() <code - line class = "line-numbers-rows" >< / code - line> spider() <code - line class = "line-numbers-rows" >< / code - line> <code - line class = "line-numbers-rows" >< / code - line> if __name__ = = '__main__' : <code - line class = "line-numbers-rows" >< / code - line> main() < / code - pre> |
Github链接:https://github.com/Cl0udG0d/Fofa-script
我设置的time.sleep()延时是10秒,可以根据自己的需求进行修改,以及,虽然在代码里面进行了base64解码,但是有的时候总会出现编码问题而导致搜索不到想要的结果,pagenum[0]等于0的情况,如果修改关键字还是不行的话,可以自己在fofa网站里面查了之后,在url中将base64之后的搜索关键字替换成代码里面的searchbs64,这样就必然能够搜索到了,这些不足的地方在下次修改的时候进行改进吧,奥利给。
__EOF__

本文链接:https://www.cnblogs.com/Cl0ud/p/12384457.html
关于博主:评论和私信会在第一时间回复。或者直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角【推荐】一下。您的鼓励是博主的最大动力!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!