Python--34 爬虫

Python如何访问互联网

  URL + lib -->  urllib 

URL的一般格式为

  protocol://hostname[:port]/path/[;parameters][?query]#fragment

URL由三部分组成

  第一部分是协议:http,https,ftp,file,ed2k......

  第二部分是存放资源服务器的域名系统或IP地址(有时候要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80)

  第三部分是资源的具体地址,如目录或文件名等

urllib包含四个模块

  urllib.request for opening and reading URLs

  urllib.error containing the exceptions raised by urllib.request

  urllib.parse for parsing URLS

  urllib.robotparser for parsing robots.txt files

    urllib.request.urlopen(url,data = None,[timeout,]*,cafile = None,capath=None,cadefault = False)

    Open the URL url,which can be either a string or a Request object.

>>> import urllib.request
>>> response = urllib.request.urlopen('http://www.weparts.net')
>>> html = response,read()
>>> print(html.decode('utf-8'))

实战

import urllib.request
response = urllib.request.urlopen('http://placekitten.com/g/500/600')
cat_img = response.read()
with open('cat_500_600.jpg','wb') as f:
    f.write(cat_img)
import urllib.request
req = urllib.request.Request('http://placekitten.com/g/500/600')
response = urllib.request.urlopen(req)
cat_img = response.read()
with open('cat_500_600.jpg','wb') as f:
    f.write(cat_img)
>>>response.geturl()
'http://placeketten.com/g/500/600'
>>>response.info()
<bound method HTTPResponse.geturl of <http.client.HTTPResponse object at 
>>>print(response.info())
0x7fe88d136f60>>
Date: Thu, 14 Sep 2017 08:10:46 GMT
Content-Type: image/jpeg
Content-Length: 26590
Connection: close
Set-Cookie: __cfduid=dc52691cf479658e05d15824990dabeb11505376646; expires=Fri, 14-Sep-18 08:10:46 GMT; path=/; domain=.placekitten.com; HttpOnly
Accept-Ranges: bytes
X-Powered-By: PleskLin
Access-Control-Allow-Origin: *
Cache-Control: public
Expires: Thu, 31 Dec 2020 20:00:00 GMT
Server: cloudflare-nginx
CF-RAY: 39e1df2a94ee77a2-LAX
>>>response.getcode()
200

data urllib .parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format

 

import urllib.request
url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule&sessionFrom='
data= {}
data['i'] = 'I love Junjie'
data['from'] = 'AUTO'
data['to'] = 'AUTO'
data['smartresult'] = 'dict'
data['client'] = 'fanyideskweb'
data['salt'] = '1505376958945'
data['sign'] = '86bb3d2294c81c8d6718e800f939bf45'
data['doctype'] = 'json'
data['version'] = '2.1'
data['keyfrom'] = 'fanyi.web'
data['action'] = 'FY_BY_CLICKBUTTION'
data['typoResult'] = 'true'
data = urllib.parse.urlencode(data).encode('utf-8')
response = urllib.request.urlopen(url,data)
html = response.read().decode('utf-8')
print(html)
import json
json,loads(html) #得到的就是一个字典

 隐藏

 urllib.request.Request(url,data = None, headers = {},origin_req_host = None,unverifiable = False, method = None)

headers should be a dictionary 

add_header()

posted @ 2017-09-14 16:48  110528844  阅读(248)  评论(0编辑  收藏  举报