利用cookie进行模拟登录并且抓取失败
首先是朋友发现每次对撞md5都要上网站登录然后进行对撞,感觉好麻烦,想写一个脚本,输入md5值直接输出
然后就上车了
1 模拟登录
老规矩,先要提交表单,进行抓包(我用的fiddler)进行抓包,看见了post的表单,但心血来潮,发现每次模拟登录都利用提交表单的形式好无聊,再加上前些日子写web,就想利cookie试试。
可以看出,这个cookie中,
CNZZDATA3819543的ntime是时间,
user相当于session,其他都一样,所以可以写出模拟登录的脚本了
import requests from bs4 import BeautifulSoup import time URL = 'http://www.xxx.com/' def get_html(url): session = requests.session() headers = {'User_Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/30.0.1581.2 Safari/537.36'} cookies = {'ASP.NET_SessionId': "eqsrnjotcaj5qdf5kmrqwgpy", 'CNZZDATA3819543': "cnzz_eid=471312766-1484873928-&ntime=%d" % int(time.time()), 'FirstVisit': "", '_test': "1", 'comefrom': "http://www.xxx.com/login.aspx", 'key': "", 'user': "kPXxHtwrSPpCMgZoXs2VrPuwuuCUrDz7dLq5R3/DBEP59eqYGYFa23AZdDPP1KDR9" "rblhGp0HWbYVkOsCg3QoRwWHIQESmZi4KqRlXxfnuZcFsrEta5SwAmrrvhpNvK" "ghSMRdyV7PTmKuagc7m8IZQ=="}
返回结果,进行解析html就可以得到用户名邮箱:
之后就可以利用session进行GET或者POST了
2 入坑,登录后的,进行md5的查询,然后抓包
接着看表单
分析表单:
__EVENTTARGET,__EVENTARGUMENT 这两个值没什么用,每次的值都是""。
__VIEWSTATE 这个值很有用,它是一种加密算法,结合了你查询的加密值和某些我未找到的值作为参数的加密算法(这是我没有实现爬虫的墙)
__VIEWSTATEGENERATOR 这个从字面上理解就是上面那个viewstate的生成器,我猜它的某种加密算法(不管了,懒得看了)
ctl00$ContentPlaceHolder1$TextBoxInput 这就是我们输入的需要解密的值
ctl00$ContentPlaceHolder1$InputHashType 这是我们选择的它是通过了什么加密,默认好像是md5
后面的值也什么大用。
其实说白了只要__VIEWSTATE 和ctl00$ContentPlaceHolder1$TextBoxInput的值相对应并且匹配,那么就没问题了。
3 最后奉献出我失败的爬虫
#!/usr/bin/python # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup import time URL = 'http://www.xxx.com/' def get_html(url): session = requests.session() headers = {'User_Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/30.0.1581.2 Safari/537.36'} cookies = {'ASP.NET_SessionId': "eqsrnjotcaj5qdf5kmrqwgpy", 'CNZZDATA3819543': "cnzz_eid=471312766-1484873928-&ntime=%d" % int(time.time()), 'FirstVisit': "", '_test': "1", 'comefrom': "http://www.xxxx.com/login.aspx", 'key': "", 'user': "kPXxHtwrSPpCMgZoXs2VrPuxxuCUrDz7dLq5R3/DBEP59eqYGYFa23AZdDPP1KDR9" "rblhGp0HWbYVkOsCg3QoRwWHIQESmZi4KqRlXxfnuZcFsrEta5SwAmrrvhpNvK" "ghSMRdyV7PTmKuagc7m8IZQ=="} playloads = {'__EVENTTARGET': "", '__EVENTARGUMENT': "", '__VIEWSTATE': "XrQ+lfRMi82hZRL/drrjo0zDnT6/XJxrr0iphlxrVrNfVusZC2UHmQL5" "i4TbbaD8N6zKVxODMamXqkA0k7T1qoNfW9dRGs/V6mEptB90XdBB4Qj1" "n1jGG/iw+p7BW4oHPanh8mWCH3G5ZWuZM4TADQoGwOuXna0OWtVK/x8k00" "+zZEwKXi0vI2T9OrysyhkZ8msq/yashFfMyDo+Qwqb3jNJWl8n844E9Kmb4" "gcBuBmifviw7jvRJjpVQNqDH+Cbee7gMEvFK4rtKxKcCkxIGNvC46F59rl" "62EfVX81NFVSD0dhGNnF7kP0WRWpcXZRoXrxd2HFodv5beAw8Gwe7IRHr59" "T8/GmiS3KVRMDXMG9OgAg13mZv9f/LogkuNmPeiIVz9fBifx2D2kUdQQfT5x" "T0wbqoGQnWqeQcEYndUCp5lA8kCID4V8p0TR3EfrzAHPlxPh7be8yNHL8iHu" "50wgxJ6BD2W3VoeF3lOShhkpnHYAeQf7TLaCCPtKleCboctIO6dbcgt1KD6S" "UvJZyWuRRxz/CBAGNEr6piRudKOgnGl+W9nBfJDS4wl3ao3Y3Rvuon0YMz68" "o+Ef4FOExM300T51rL5HF5e8zyw+V68ISvXAoHJmhzt64j+ht0jOUzLI1UTXo" "MOg894gucdsH8VOpVNPO5F+4/03JHqi8R4cSHnFu9U9gYpnGBhIhZuzzyiLHj" "a3gqyHzehKBlWq53eOhXJH/IfVjGZ9ltjZHi9smWCMonqvZRTm0vD6nKCsQWi" "JILUzb8YrI7xzYgjHihSEyYc3qi9ze6uSwUdeJbQdqKiGVWMWt+gRxi7JZDae" "SMfN3NvavFtXdyBVyI1KFuP9LBYDYEH1RD6HXqVsblH4C1dIAq7yQnu4L20OzI" "E841MIiwLdQVAQ9aAwD3wqvPqoBJfqbkMBKQ7xSiDF+FSRacJ/IHOAJkMoqKJe4LY" "Csh0tPK1tK1pW7xF/X+PtQCQQ+Ldin76t3bpeY2KAQeF5cXEP94DIYydiJBfn4zJv+D" "QBzb0zRabwy5GBB1YDY9Fxiw34G1rB18yOlTwl2bpFnUArplpB0TwfjGkA7Up2MCrOy" "s6oDDdRn+1AQOETo7Ych274ymw+ThCzUrJeVNPf5/X2FJCJpqeH0TRCSs+0fxbaljihS9" "p3t1WqTxTHWKsh4TsZBQsn90kSItZS/dGYhNH/XUVombBi92AhUrokHqQC4b0mGdIRFRzg" "6l2lF4VfZbDfIayTgnZbT+N9RwcduCZCRWcUupLLcKnCZHuqd7WStG33dTk9IT/5q2xf57G" "fRDxslLzN1VIDn8Wtcl494OJPSPqr5+FB8mTs24UjM+6IwgVNstkJFIH1urQWl31TVUg" "nhtrIQEs4MpyeeUUwlV2CCfxP+JTGbZsuMHdd/RDwp9xH28dGQD0cikU8RlCut/XThG" "W10bPC2akAXO5xmACNBhY9XKvyMzg8D43AFa3xAxV+e9lwPhNHIQCX7c6m/t5rQztzM" "+TiraaMMGXZVyjFic757VcJHlU5We8r7lWsKBRbrqnIEV6JMi8dzmb5rLYbBbLI4N9Q" "DIwy5r0HKDmepTjhZY3DIFLkdO9RakjAoiFUs2e9h+wPxBQGQ+UbyWXzfSWa8hXKSGL" "kw774/Et5XfCPVaDBkqPPzKlX3QoV5ptuRuDCwzLdXpuBePhme64x09L9XOmIYFdaGJ" "MXjw/tKRTv6AFgGLvZyso+Ch9XLI/j5abcaLyC/nSUdsxexRPkV/wRB5pSsaau43nMn" "iMpuAVVxwryPTGnnAO38vl26BAo73jlvNvmP0Av22/3P+A2CmCcJt6S5bH7Jcw6S6HJ" "QXWDtnFGg6sYCi6mzvwmYFcBEeVzOKHJ8f7TxP7n5CbNjXWnBguSFL1UzH83DTcij6s+1lctI" "fw4NIN7NU5P+qInfSRvBH3754GAuSApuLZHOp/9k8fkkxlA==", '__VIEWSTATEGENERATOR': "CA0B0334", 'ctl00$ContentPlaceHolder1$TextBoxInput': "21232f297a57a5a743894a0e4a801fc3", 'ctl00$ContentPlaceHolder1$InputHashType': "md5", 'ctl00$ContentPlaceHolder1$Button1': "查询", 'ctl00$ContentPlaceHolder1$HiddenField1': "", 'ctl00$ContentPlaceHolder1$HiddenField2': "gnSxKhU+42ESHE0pCcCyudmYfvxVL2+w4IhvdkwT37OI/" "QODVV7mdVAN9puROPjh"} text = session.post(url, headers=headers, data=playloads, cookies=cookies).text session.close() return text def parser_html(text): soup = BeautifulSoup(text, 'html.parser') string_gen = soup.find('div', class_='main').find('table', id='table3').\ find('span', id='ctl00_ContentPlaceHolder1_LabelAnswer').strings #strings属性返回一个生成器, 生成器返回的是一个iterable result = list(string_gen)[0] return result if __name__ == '__main__': text = get_html(URL) print parser_html(text)