Python_爬虫的简单示例
爬取sohu所有的链接,找到所有包含足球的网页内容,并进行下载保存。
文件序号从1.html......n.html
request访问sohu首页,获取源码
使用正则获取网页链接:建议大家获取所有的链接后打印一下内容,在看
怎么处理链接。
处理网页链接:拼接http:// 过滤掉jpg\css\js\png等无效链接
放入爬取列表。
爬之,判断是否包含关键字足球,如果有,保存在文件中。
'''
import requests import re r = requests.get("http://www.sohu.com") #print(r.text) valid_link = [] all_links = re.findall(r"href=\"(.*?)\" ",r.text) for link in all_links[50:250]: if link[-3:] not in ["jpg","png","gif","js","css","ico"]: if link.strip()=="/": continue if "javascript:" in link: continue if link.startswith("//"): link="http:"+link print(link.strip()) valid_link.append(link.strip()) print(valid_link) result=[] for link in valid_link: if "足球" in requests.get(link).text: result.append(link) print (len(result))
****************************************************
import requests import re r = requests.get("http://www.sohu.com") #print(r.text) valid_link = [] all_links = re.findall(r"href=\"(.*?)\" ",r.text) for link in all_links: if link[-3:] not in ["jpg","png","gif","js","css","ico"]: if link.strip()=="/": continue if "javascript:" in link: continue if link.startswith("//"): link="http:"+link print(link.strip()) valid_link.append(link.strip()) print(valid_link) result=[] for link in valid_link: if link.strip()=="": continue try: if "足球" in requests.get(link).text: result.append(link) except: continue print (len(result))