第一个爬虫和测试

1.网络爬虫

Python是简单网络爬虫：

简单的爬虫工具:BeautifulSoup

BeautifulSoup拥有强大的解析网页及查找元素的功能

在使用BeautifulSoup4 库之前，需要进行引用，由于这个库的名字非常特殊且采用面向对象方式组织，

可以用from…import 方式从库中直接引用BeautifulSoup 类，方法如下。

插入代码如下：

>>>from bs4 import BeautifulSoup

BeautifulSoup的安装：

在电脑上cmd中输入以下代码：

pip install python-bs4

BeautifulSoup 的属性：

在BeautifulSoup4库中每一个Tag标签，也称为Tag对象

标签Tag有4个常用属性为：

了解了BeautifulSoup，还要有一个依赖的工具requests

requests库是一个简洁且简单的处理HTTP请求的第三方库，并且request 库支持非常丰富的链接访问功能。

requests 库的安装代码为：

pip install requests

requests 库中的网页请求函数：

response 对象的属性

response 对象的方法

了解了基础的知识，现在我们来进行网页爬虫：

对google主页进行爬虫：

1.利用request库的get（）函数访问google 20次，输入代码为：

import requests
def getHTMLText(url):
    print("第",i+1,"次访问")
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding='utf-8'
        print("网络状态码:",r.status_code)
        print("text属性长度:",len(r.text))
        print("content属性长度:",len(r.content))
        return r.text
    except:
        return "error"
url="http://www.google.cn"
for i in range(20):
    print(getHTMLText(url))

打印返回状态，其中text（）内容即为url对应的内容，其显示的结果为：（由于数据太多，只截取其中第一次的显示代码）

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> 
 RESTART: C:/Users/Administrator/AppData/Local/Programs/Python/Python37-32/python代码文件/py1.3.py 
第 1 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213
<!DOCTYPE html>
<html lang="zh">
  <meta charset="utf-8">
  <title>Google</title>
  <style>
    html { background: #fff; margin: 0 1em; }
    body { font: .8125em/1.5 arial, sans-serif; text-align: center; }
    h1 { font-size: 1.5em; font-weight: normal; margin: 1em 0 0; }
    p#footer { color: #767676; font-size: .77em; }
    p#footer a { background: url(//www.google.cn/intl/zh-CN_cn/images/cn_icp.gif) top right no-repeat; padding: 5px 20px 5px 0; }
    ul { margin: 2em; padding: 0; }
    li { display: inline; padding: 0 2em; }
    div { -moz-border-radius: 20px; -webkit-border-radius: 20px; border: 1px solid #ccc; border-radius: 20px; margin: 2em auto 1em; max-width: 650px; min-width: 544px; }
    div:hover, div:hover * { cursor: pointer; }
    div:hover { border-color: #999; }
    div p { margin: .5em 0 1.5em; }
    img { border: 0; }
  </style>
  <div>
    <a href="http://www.google.com.hk/webhp?hl=zh-CN&amp;sourceid=cnhp">
      <img src="//www.google.cn/landing/cnexp/google-search.png" alt="Google" width="586" height="257">
    </a>
    <h1><a href="http://www.google.com.hk/webhp?hl=zh-CN&amp;sourceid=cnhp"><strong id="target">google.com.hk</strong></a></h1>
    <p>请收藏我们的网址
  </div>
  <ul>
    <li><a href="http://translate.google.cn/?sourceid=cnhp">翻译</a>
  </ul>
  <p id="footer">&copy;2011 - <a href="http://www.miibeian.gov.cn/">ICP证合字B2-20070004号</a>
  <script>
    var gcn=gcn||{};gcn.IS_IMAGES=(/images\.google\.cn/.exec(window.location)||window.location.hash=='#images'||window.location.hash=='images');gcn.HOMEPAGE_DEST='http://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp';gcn.IMAGES_DEST='http://images.google.com.hk/imghp?'+'hl=zh-CN&sourceid=cnhp';gcn.DEST_URL=gcn.IS_IMAGES?gcn.IMAGES_DEST:gcn.HOMEPAGE_DEST;gcn.READABLE_HOMEPAGE_URL='google.com.hk';gcn.READABLE_IMAGES_URL='images.google.com.hk';gcn.redirectIfLocationHasQueryParams=function(){if(window.location.search&&/google\.cn/.exec(window.location)&&!/webhp/.exec(window.location)){window.location=String(window.location).replace('google.cn','google.com.hk')}}();gcn.replaceHrefsWithImagesUrl=function(){if(gcn.IS_IMAGES){var a=document.getElementsByTagName('a');for(var i=0,len=a.length;i<len;i++){if(a[i].href==gcn.HOMEPAGE_DEST){a[i].href=gcn.IMAGES_DEST}}}}();gcn.listen=function(a,e,b){if(a.addEventListener){a.addEventListener(e,b,false)}else if(a.attachEvent){var r=a.attachEvent('on'+e,b);return r}};gcn.stopDefaultAndProp=function(e){if(e&&e.preventDefault){e.preventDefault()}else if(window.event&&window.event.returnValue){window.eventReturnValue=false;return false}if(e&&e.stopPropagation){e.stopPropagation()}else if(window.event&&window.event.cancelBubble){window.event.cancelBubble=true;return false}};gcn.resetChildElements=function(a){var b=a.childNodes;for(var i=0,len=b.length;i<len;i++){gcn.listen(b[i],'click',gcn.stopDefaultAndProp)}};gcn.redirect=function(){window.location=gcn.DEST_URL};gcn.setInnerHtmlInEl=function(a){if(gcn.IS_IMAGES){var b=document.getElementById(a);if(b){b.innerHTML=b.innerHTML.replace(gcn.READABLE_HOMEPAGE_URL,gcn.READABLE_IMAGES_URL)}}};
    gcn.listen(document, 'click', gcn.redirect);
    gcn.setInnerHtmlInEl('target');
  </script>

第 2 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 3 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 4 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 5 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 6 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 7 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 8 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 9 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 10 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 11 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 12 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 13 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 14 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 15 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 16 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 17 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 18 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 19 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 20 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

>>>

利用菜鸟教程进行内容叙写文章及表格：

<!DOCTYPE html>

<html>

<head>

<meta charset="utf-8">

<title>菜鸟教程（runoob.com)</title>

</head>

<body>

         <hl>我的第一个标题学号25</hl>

         <p id="first">我的第一个段落。</p>

</body>

                  <table border="1">

          <tr>

                  <td>row 1, cell 1</td>

                  <td>row 1, cell 2</td>

         </tr>

         <tr>

                  <td>row 2, cell 1</td>

                  <td>row 2, cell 2</td>

         <tr>

</table>

</html>

由上面编程代码，结果显示为：

爬取最好大学网2017年中国大学排名网站

软科中国大学排名2017网页片段

输入如下代码为：

import requests
from bs4 import BeautifulSoup
allUniv = []
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.text
    except:
        return ""
def fillUnivList(soup):
    data = soup.find_all('tr')
    for tr in data:
        ltd = tr.find_all('td')
        if len(ltd)==0:
            continue
        singleUniv = []
        for td in ltd:
            singleUniv.append(td.string)
        allUniv.append(singleUniv)
def printUnivList(num):
    print("{1:^2}{2:{0}^10}{3:{0}^6}{4:{0}^4}{5:{0}^10}".format(chr(12288),"排名","学校名称","省市","总分","科研规模"))
    for i in range(num):
        u=allUniv[i]
        print("{1:^2}{2:{0}^10}{3:{0}^6}{4:{0}^8.1f}{5:{0}^10}".format(chr(12288),i+1,u[1],u[2],eval(u[3]),u[6]))
def main():
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2017.html'
    html = getHTMLText(url)
    soup = BeautifulSoup(html, "html.parser")
    fillUnivList(soup)
    printUnivList(10)
main()

得出最好大学排名的前10名，结果显示为：

此次的python爬虫学习到此结束..........................................................................

posted on 2019-05-23 13:11 孻孻阅读(652) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

导航

第一个爬虫和测试

1.网络爬虫

对google主页进行爬虫：

爬取最好大学网2017年中国大学排名网站