Python_爬虫的简单示例

爬取sohu所有的链接,找到所有包含足球的网页内容,并进行下载保存。
文件序号从1.html......n.html

request访问sohu首页,获取源码
使用正则获取网页链接:建议大家获取所有的链接后打印一下内容,在看
怎么处理链接。
处理网页链接:拼接http:// 过滤掉jpg\css\js\png等无效链接
放入爬取列表。
爬之,判断是否包含关键字足球,如果有,保存在文件中。
'''

import requests
import re
r = requests.get("http://www.sohu.com")
#print(r.text)
valid_link = []
all_links = re.findall(r"href=\"(.*?)\" ",r.text)
for link in all_links[50:250]:
    if link[-3:] not in ["jpg","png","gif","js","css","ico"]: 
        if link.strip()=="/":
            continue
        if "javascript:"  in link:
            continue
        if link.startswith("//"):
            link="http:"+link
        print(link.strip())
        valid_link.append(link.strip())
print(valid_link)
result=[]
for link in valid_link:
    if "足球" in  requests.get(link).text:
        result.append(link)        
print (len(result))

 



****************************************************

import requests
import re
r = requests.get("http://www.sohu.com")
#print(r.text)
valid_link = []
all_links = re.findall(r"href=\"(.*?)\" ",r.text)
for link in all_links:
    if link[-3:] not in ["jpg","png","gif","js","css","ico"]: 
        if link.strip()=="/":
            continue
        if "javascript:"  in link:
            continue
        if link.startswith("//"):
            link="http:"+link
        print(link.strip())
        valid_link.append(link.strip())
print(valid_link)
result=[]
for link in valid_link:
    
    if link.strip()=="":
        continue
    try:
        if "足球" in  requests.get(link).text:
            result.append(link)        
    except:
        continue
print (len(result))

 

posted @ 2019-10-20 20:24  翻滚的小强  阅读(169)  评论(0编辑  收藏  举报