刷题(十四)

题目

按下述要求编写代码，实现以下功能：
（1）编写代码下载http://www.newsgd.com/news/2020-05/05/content_190848024.htm 页面的内容并保存为mt.html
（2）统计mt.html中
标签下所有单词并存储到mt_word.txt中，要求：
a) 每个单词一行。单词在前，单词出现的次数在后，中间用Tab( )分隔。
b) 单词按照数目从多到少排列。比如说单词a出现了100次，单词b出现了10次，则单词a要在单词b前面

分析

重点是掌握BeautifulSoup的用法，以及文本的处理，后面的排序和之前的题目类似

代码实现

import requests
from bs4 import BeautifulSoup

#1. 下载页面的内容，并保存为mt.html
response = requests.request("get", "http://www.newsgd.com/news/2020-05/05/content_190848024.htm")

with open("mt.html", "wb") as f:
    f.write(response.content)


#2. 统计mt.html中<p>标签内所有单词以及书目，并保存在mt_word.txt中

#解析页面，拿到所有的p标签中的文本
soup = BeautifulSoup(response.text, features="lxml")
tag = soup.find_all(name = "p")
list_p = []
for i in tag:
    list_p.append(i.get_text())

#将所有的文本合并成一个字符串
str_p = " ".join(list_p)
print(str_p)
word_set = set()
for word in str_p.split():   #str_p.split()，将字符串按照指定分隔符分隔为列表，默认分隔符为空格、换行(\n)、制表符(\t)等
    word = word.strip(',.()""/; ')  #strip，去除每一个字符串收尾的特殊符号
    word_set.add(word)  #set去重

word_list = []
for word in word_set:
    if word == "":
        continue
    dict_2 = {word: str_p.count(word)}
    word_list.append(dict_2)
print(word_list)

#将单词按照数目反序排列，然后写入文件
blist = sorted(word_list, key=lambda x: list(x.values())[0], reverse=True)
with open("mt_word.txt", "w", encoding="utf-8") as f:
    for item in blist:
        for key, value in item.items():
            line = key + "\t" + str(value) + "\n"
            f.write(line)

参考文章

《Python基础面试题整理》
《面试题（三）》

posted @ 2020-05-05 10:54 cnhkzyy 阅读(143) 评论(0) 收藏举报

刷新页面返回顶部

cnhkzyy

认真写博客，努力加餐饭

刷题(十四)

题目

分析

代码实现

参考文章

公告