2020 年 5月 17 日随笔档案 - cltt

Beautifulsoup

摘要： Beautiful Soup：解析HTML页面信息标记与提取方法获取网页源代码 import requests from bs4 import BeautifulSoup kv = {'user-agent':'Mozilla/5.0'} url = "https://python123.io/w 阅读全文

posted @ 2020-05-17 22:37 cltt 阅读(361) 评论(0) 推荐(0) 编辑

实例5：IP地址归属地的自动查询

摘要： #ip查询全代码 import requests import time url='http://www.ip138.com/ips138.asp?ip=202.204.80.112' r = requests.get(url) print(r.status_code) print(r.reques 阅读全文

posted @ 2020-05-17 22:14 cltt 阅读(1776) 评论(0) 推荐(1) 编辑

实例4：网络图片的爬取和存储

摘要：网络图片链接的格式：http://www.example.com/picture.jpg 图片爬取代码 import requests import os #url = 'https://image.baidu.com/search/detail?ct=503316480&z=&tn=baiduim 阅读全文

posted @ 2020-05-17 17:18 cltt 阅读(383) 评论(0) 推荐(0) 编辑

实例3：百度360搜索关键词提交

摘要：百度搜索 import requests keyword = 'Python' try: kv = {'wd':keyword} r = requests.get('http://www.baidu.com/s',params=kv) print(r.request.url) r.raise_for 阅读全文

posted @ 2020-05-17 16:34 cltt 阅读(1103) 评论(0) 推荐(0) 编辑

爬虫实战2 亚马逊

摘要： import requests r= requests.get('https://www.amazon.cn/dp/B01MYH8A99') print(r.status_code) r.encoding = r.apparent_encoding print(r.text) print(r.req 阅读全文

posted @ 2020-05-17 11:58 cltt 阅读(403) 评论(0) 推荐(0) 编辑

爬虫实战1 京东

摘要： url="https://item.jd.com/100012881854.html" kv = {'user-agent':'Mozilla/5.0'} r = requests.get(url,headers = kv) print(r.status_code) print(r.encoding 阅读全文

posted @ 2020-05-17 11:51 cltt 阅读(464) 评论(0) 推荐(1) 编辑

爬虫带来的问题

摘要：爬虫的限制来源审查发布公告 Robots协议实例 Robots协议基本语法 robots协议都在根目录下 Robots协议的遵守方式使用网络爬虫：自动或人工识别robots.txt,再进行内容爬取。约束性如何遵守阅读全文

posted @ 2020-05-17 11:38 cltt 阅读(177) 评论(0) 推荐(0) 编辑

requests 简介

摘要： import requests r = requests.get('http://www.baidu.com') print(r.status_code) r.encoding = 'utf-8'#不然会乱码 print(r.text) 200<!DOCTYPE html><!--STATUS OK 阅读全文

posted @ 2020-05-17 09:05 cltt 阅读(268) 评论(0) 推荐(0) 编辑

导航

公告