beauifulsoup模块的介绍

01 爬虫基础知识介绍

　　　相关库：1.requests,re 2.BeautifulSoup 3.hackhttp

　　使用requests发起get，post请求，获取状态码，内容；

　　使用re匹配随便一个帖子

 BeautifulSoup模块的使用介绍：在这里一定要看官方文档http://beautifulsoup.readthedocs.io/zh_CN/latest/

　　1.解析内容：soup= BeautifulSoup(html)

　　2.浏览数据：soup.title   soup.title.string
　　3.BeautifulSoup正则使用： soup.find_all(name='x',attrs={'xx':re.complie('x')
　　　　　　　　　　　　name 代表标签的名称    attrs 标签中的参数内容

#针对thread-41730-1-1.html怎么做？
bbs_new=soup.find_all(name='a',attrs={'href':re.compile('thread-\d*?-1-1.html')})

02 爬虫简单实现

03 正则表达式的应用

04 多线程python爬虫

05 爬虫实战

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import  re

#要爬取的地址
url ='https://bbs.ichunqiu.com/portal.php'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}


#对url发送get请求
r= requests.get(url=url,headers=headers)

print(r.status_code)
#html的内容r.content
print(r.content)
#html网页内容放入beautifulsoup进行解析
soup =BeautifulSoup(r.content,'lxml')  #需要lxml参数
print(soup.title)
print(soup.title.string)
#获取内容实例，万金油  正则使用
#bbs_new=soup.find_all(name='a',attrs={'target':"blank", 'class':"ui_colorG" ,'style':"color: #555555;"})

#针对thread-41730-1-1.html怎么做？
bbs_new=soup.find_all(name='a',attrs={'href':re.compile('thread-\d*?-1-1.html')})

for new in bbs_new:
    print(new.string)  #不加string 默认返回整个标签的内容

posted @ 2018-06-19 10:14 klsfct 阅读(300) 评论(0) 收藏举报

刷新页面返回顶部

klsfct

又惘又怠

klsfct

beauifulsoup模块的介绍

公告