slowlydance2me

2022年10月29日

摘要：代码如下： import jieba jieba.setLogLevel(jieba.logging.INFO) # 不打印jieba自带的记录 sentence = input("输入句子：") seg_list = jieba.cut(sentence) print("输出句子：", "/".j 阅读全文

posted @ 2022-10-29 14:15 slowlydance2me 阅读(52) 评论(0) 推荐(0) 编辑

python 安装 jieba分词第三方库报错以及解决

摘要：在安装jieba第三方库的时候，Python报错pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. 阅读全文

posted @ 2022-10-29 14:04 slowlydance2me 阅读(203) 评论(0) 推荐(0) 编辑

python Tips -----常用数据结构

摘要：常用数据结构 list[] 创建： 1.通过中括号括起已知的元素创建list mylist = ['orange', 'apple', 1,2,3.14]; 2.通过中括号创建空list，然后用append()追加动态元素 mylist = []; mylist.append('orange'); 阅读全文

posted @ 2022-10-29 10:29 slowlydance2me 阅读(26) 评论(0) 推荐(0) 编辑

2022年10月28日

文本数据挖掘作业实验1 -----爬取数据

摘要： # 1.定位到来电分类分区 # 2. 提取子页面的连接地址 child_href1 # 3. 在子页面提取想要的数据 # 4. 再定位到详细来电，进入二重子页面 # 5. 提取二重子页面连接地址 child_href2 # 6. 在二重子页面（来电情况）里提取想要的数据代码如下： 1 # 1.定位阅读全文

posted @ 2022-10-28 22:57 slowlydance2me 阅读(92) 评论(0) 推荐(0) 编辑

python 爬虫 -----爬取猪八戒网

摘要： 1.使用元素定位：找到一个模块的分区，复制它完整的Xpath 2. 修饰并利用循环得出每一个模块 import requests from lxml import etree # 获取源码 url = "https://chengdu.zbj.com/search/service/?kw=saas" 阅读全文

posted @ 2022-10-28 20:40 slowlydance2me 阅读(274) 评论(0) 推荐(0) 编辑

python 爬虫 ----- xpath

摘要： xpath 是在XML文档中搜索内容的一门语言 html是xml的一个子集 xml代码示例 """ <book> <id>1</id> <name>野花遍地香</name> <price>1.23</price> <author> <nick>周大枪</nick> <nick>周芷若</nick> 阅读全文

posted @ 2022-10-28 19:57 slowlydance2me 阅读(30) 评论(0) 推荐(0) 编辑

python 爬虫 -----Bs4 爬取并且下载图片

摘要： # 1.拿到主页面主代码，拿到子页面连接地址，href # 2.通过href拿到子页面内容，从子页面中找到图片的下载地址 img -> src # 3. 下载图片 import requests from bs4 import BeautifulSoup import time import url 阅读全文

posted @ 2022-10-28 19:30 slowlydance2me 阅读(120) 评论(0) 推荐(0) 编辑

python 爬虫 Bs4解析 -----HTML语法

摘要： Bs4 bs4全称：beautifulsoup4，意思为美丽的汤版本4 可以在HTML或XML文件中提取数据的网页信息提取库与re和xpath模块的区别： re模块：使用起来过于麻烦且阅读性不好 xpath模块：需要使用一些特定的语法 bs4模块：只需要记住一些方法如：find()、find_al 阅读全文

posted @ 2022-10-28 13:33 slowlydance2me 阅读(95) 评论(0) 推荐(0) 编辑

2022年10月27日

python 爬虫 -----爬取电影天堂

摘要：代码如下：# 1. 定位到电影天堂最新电影更新栏目 # 2. 从其中提取到子页面的连接地址 # 3. 请求子页面的连接地址并拿到下载地址 import requests import re domain = "https://dy.dytt8.net/index2.htm" resp = reque 阅读全文

posted @ 2022-10-27 23:02 slowlydance2me 阅读(893) 评论(0) 推荐(0) 编辑

python 爬虫-----爬取豆瓣Top250 排行榜电影

摘要： step1. 打开网页并且产看源代码使用shift+F 搜索原网页中的想查找的内容发现源代码中存在相关信息说明该网页是服务器加载的数据所以我们的任务就是提取源代码并且利用re正则表达式提取数据 step2. 输入代码，请求获取网页源代码，发现无响应，说明网页采用反扒措施，需要更改用户代理U 阅读全文

posted @ 2022-10-27 21:31 slowlydance2me 阅读(140) 评论(0) 推荐(0) 编辑

公告