青春叛逆者

2018年12月25日

摘要： import re import lxml.html test_data = """ 抓取下面10个ip地址 128 54 38 220 . 107 12 . 99 75 . 79 . . . 82 196 . 74 179 141 . . . 180 162 45 196 . 119 157 188 222 . 37 . 165 25 79 154 . 11 61 . 239 102 ... 阅读全文

posted @ 2018-12-25 09:49 青春叛逆者阅读(392) 评论(0) 推荐(0) 编辑

2018年12月24日

python 爬虫贴吧猫进阶

摘要： import requests from lxml import etree import json class Tieba: def __init__(self,tieba_name): self.tieba_name = tieba_name #接收贴吧名 #设置为手机端的UA self.headers = {"User-Agent... 阅读全文

posted @ 2018-12-24 20:23 青春叛逆者阅读(317) 评论(0) 推荐(0) 编辑

python 多线程爬虫(糗事百科)

摘要： # coding=utf-8 import requests from lxml import etree import json from queue import Queue import threading class Qiubai: def __init__(self): self.headers = { "User-Agent": "M... 阅读全文

posted @ 2018-12-24 20:21 青春叛逆者阅读(240) 评论(0) 推荐(0) 编辑

python 爬虫获取登录cookie

摘要： import lxml.html import requests def parse_form(html): tree=lxml.html.fromstring(html) data={} for e in tree.cssselect('form input'): if e.get('name'): data[e.get('nam... 阅读全文

posted @ 2018-12-24 16:53 青春叛逆者阅读(2812) 评论(0) 推荐(0) 编辑

python css选择器 -爬虫坑

摘要：菜鸟教程(runoob.com) My name is Donald I live in Ducksburg 注意: :before 作用于 IE8,DOCTYPE 必须已经声明. 菜鸟教程(runoob.com) 我的名字是 Donald 我住在 Ducksburg 注意： :after在IE8中运行，必须声明 !DOCTYPE ... 阅读全文

posted @ 2018-12-24 15:00 青春叛逆者阅读(509) 评论(0) 推荐(0) 编辑

18个常用的Linux 命令

摘要： pwd 显示工作路径 ls 查看目录中的文件cd /home进入/ home'目录' cd ..返回上-级目录 cd ../..返回.上两级目录 mkdir dir1 创建一个叫做 'dir1' 的目录' rm -f file1 删除一个叫做'file1' 的文件'，-f 参数,忽略不存在的文件,从不给出提示。 rmdir dir1删除一个叫做 'dir1' 的目录 groupadd group... 阅读全文

posted @ 2018-12-24 08:52 青春叛逆者阅读(178) 评论(0) 推荐(0) 编辑

2018年12月22日

python 基础知正则表达式

摘要： # 正则表达式 ### 应用场景 - 特定规律字符串的查找，切割、替换等 - 特定格式(邮箱、手机号、IP、URL等)的校验 - 爬虫项目中，提取特定内容 ### 使用原则 - 只要使用字符串函数能够解决的问题就不要使用正则 - 正则的效率比较低，同时会降低代码的可读性 - 世界上最难理解的三样东西：医生的处方、道士的神符、码农的正则 - 提醒：正则是用来写的，不是用来读的；在不清楚功能... 阅读全文

posted @ 2018-12-22 17:59 青春叛逆者阅读(215) 评论(0) 推荐(0) 编辑

python 多功能下载网页

摘要： #下载网页 #具有功能：捕获异常，重试下载并设置用户代理 import urllib.request import urllib.error #下载网页 #wscp:默认用户代理 web scraping with python 缩写 def download(url, user_agent='wscp',num_retries=2): print('Downloading:',url)... 阅读全文

posted @ 2018-12-22 11:43 青春叛逆者阅读(284) 评论(0) 推荐(0) 编辑

Python3 安装urllib2包之小坑

摘要：下面提供一个实例，帮助大家理解：阅读全文

posted @ 2018-12-22 11:01 青春叛逆者阅读(26124) 评论(0) 推荐(2) 编辑

python 爬虫需要的库

摘要： pip install builtwit 该模块将URL作为参数，下载该URL并对其进行分析,然后返回该网站使用的技术。下面是使用该模块的-一个例子。 import builtwith builtwith.parse('http://example.webscraping.com') {'web-servers': ['Nginx'], 'web-frameworks': ['Web2py'... 阅读全文

posted @ 2018-12-22 10:21 青春叛逆者阅读(374) 评论(0) 推荐(0) 编辑

公告