20170820_python实时获取某网站留言信息

    主要用的是request和bs4,遇到最大的问题是目标站是gb2312编码,python3的编码虽然比2的处理要好得多但还是好麻烦,

最开始写的是用cookie模拟登陆,但是这个在实际使用中很麻烦,需要先登陆目标网站,然后把cookie复制下来拷贝到代码中...懒惰是

第一动力!

    准备用火狐的httpfox获取下目标站post的数据和地址,发现火狐浏览器自动升级到了55.x,插件只能用在35.x版本,然后用chrome发现这

个网站提交post请求是打开了一个新的页面,然后新页面再点F12就晚了,看不到post了,然后百度一番发现可以设置新标签页开启F12!如图:

 

    然后就知道了这个网站都post了什么数据,开始用requests模拟post,但是发现每次都登录失败,而且抓取的网页内容都是乱码,用了str('info', encoding='utf-8')才有所好转

发现根本就没有登录成功,然后提示输入账号密码登录。

灵光一闪!!!

估计是我post的数据是utf8而目标站接收post时是gb2312,根本看不懂啊!果断把用户名(用户名是中文!!!)  username.encode("gb2312")之后顺利登录成功!然后又

开启了session保持cookie,持久化登录。然后每分钟判断下最后一个id是否等于保存的id,判断是否进行抓取。

 

效果如下:

 

#-*-coding:utf-8-*- #编码声明
import requests,re,time,json,os
from bs4 import BeautifulSoup
from time import strftime,gmtime

LOGIN_URL = 'http://www.3456.tv/Default.aspx'  #请求的URL地址
username = '用户名'
password = 'password'
DATA = {"web_top_two2$txtName":username.encode("gb2312"), "web_top_two2$txtPass":password, '__VIEWSTATE':'/wEPDwULLTEyNzc4MjM2OTBkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRh3ZWJfdG9wX3R3bzIkaW1nQnRuTG9naW6/pqbjQqV358GfYjdoiOK+Ek4VWA==','__EVENTVALIDATION':'/wEWBAL3y5PLCgLHgt+5BgL3r9v/CgLX77PND5R1XxTeGn4lXvBDrb6OdRyc4Xlk','web_top_two2$imgBtnLogin.x':'22','web_top_two2$imgBtnLogin.y':'8'}   #登录系统的账号密码,也是我们请求数据
HEADERS = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'} #模拟登陆的浏览器
S = requests.Session()
login = S.post(LOGIN_URL,data=DATA,headers=HEADERS)  #模拟登陆操作

def getData(num):
    url = 'http://www.3456.tv/user/list_proxyall.html'
    res = S.get(url)
    content = res.content
    return content

def getLast():
    url = 'http://www.3456.tv/user/list_proxyall.html'
    res = S.get(url)
    content = res.content
    soup = BeautifulSoup(content,'html.parser')
    tb = soup.find_all('tr',style='text-align:center;')
    for tag in tb:
        see = tag.find('a', attrs={'class':'see'})
        seestr = see['onclick']
        seenum = re.sub("\D", "", seestr)
        break
    return seenum

def isNew():
    newlastid = getLast()
    with open('lastid.txt') as txt:
        last = txt.read()
    if int(newlastid) != int(last):
        print('当前时间:' + strftime("%H-%M") + ',发现新留言,获取中!')
        getNewuser()
    else:
        print('当前时间:' + strftime("%H-%M") + ',暂时没有新留言')

def getNewuser():
    url = 'http://www.3456.tv/user/list_proxyall.html'
    res = S.get(url)
    content = res.content
    soup = BeautifulSoup(content,'html.parser')
    tb = soup.find_all('tr',style='text-align:center;')

    with open('lastid.txt') as txt:
        last = txt.read()
    userinfo = ''
    for tag in tb:
        see = tag.find('a', attrs={'class':'see'})
        seestr = see['onclick']
        seenum = re.sub("\D", "", seestr)
        
        if int(seenum) == int(last):
            break
        userinfo += (str(seeInfo(int(seenum)), encoding = "utf-8") + '\n')

    userfilename = strftime("%H-%M") + '.txt'
    with open( userfilename, 'w') as f:
        f.write(str(userinfo))
    os.system(userfilename)

    with open('lastid.txt', 'w') as f2:
        f2.write(str(getLast()))
    print('本次抓取完成,当前时间:' + strftime("%H-%M") + ',60秒后继续执行')

def seeInfo(id):
    url = 'http://www.3456.tv/user/protel.html'
    info = {'id':id}
    res = S.get(url,data=info)
    content = res.content
    return content

setsleep = 60 #修改这个设置每次抓取间隔,60为60秒

print('this time is today first time start?')
firststr = input('input yes or no and press enter: ')
if firststr == 'yes':
    print('正在抓取中...')
    lastid = getLast()
    with open('lastid.txt', 'w') as f:
        f.write(str(lastid))
    print('当前时间:' + strftime("%H:%M") + ',当前第一条数据id为' + lastid)
    print(str(setsleep) + '秒后继续执行')
else:
    print(str(setsleep) + '秒后继续执行')
while 1:
    isNew()
    time.sleep(int(setsleep))

 

posted @ 2017-08-20 11:07  emmmmmm1  阅读(288)  评论(0编辑  收藏  举报