word_cloud 词云使用-基本封装

wordcloud 词云

参考：jieba库及wordcloud库的使用、【官网文档】

wordcloud 库是基于 Python 的词云生成类库，很好用，而且功能强大

在用之前，你需要先安装 wordcloud 模块：pip3 install wordcloud （要安装的东西比较多，建议使用淘宝源）

可以你给出的关键词用一张图的形式展现出来，图形样子可以自定义字体、颜色、形状等

可以用来展示文章的关键字，或者搜索热点词展示（字的大小，代表着出现频率，超级直观）

wordcloud库基本使用

代码

import wordcloud

c = wordcloud.WordCloud()                           # 生成词云对象     
c.generate("wordcloud by Python")                   # 加载词云文本
c.to_file("wordcloud.png")                          # 输出词云文件

WordCloud 方法的参数如下：（所有参数都是可选参数）

参数	作用
width	指定词云对象生成的图片的宽度(默认为200px)
height	指定词云对象生成的图片的高度(默认为400px)
min_font_size	指定词云中字体最小字号，默认为4
max_font_size	指定词云中字体最大字号
font_step	指定词云中字体之间的间隔，默认为1
font_path	指定字体文件路径（文本有中文必须指定，否则无法正常显示）
max_words	指定词云中能显示的最多单词数，默认为200
stop_words	指定在词云中不显示的单词列表（set 类型）
background_color	指定词云图片的背景颜色，默认为黑色
mask	背景图片地址
margin	文本距离图片边缘的间距
...	其他参数

指定词云形状

关于 jieba 模块，可以参考我的上篇博客：【jieba 分词个人笔记-使用指南】

import jieba
import wordcloud
from scipy.misc import imread

mask = imread("yun.png")                # 读取图片数据到mask中

f = open("文档.txt", "r", encoding="utf-8")
data = f.read()
f.close()

ls = jieba.lcut(data)                   # 分词
txt = " ".join(ls)                      # 将列表中的单词连接成一个字符串

w = wordcloud.WordCloud(mask=mask)      # 指定词云形状
w.generate(txt)
w.to_file("output.png")

海上钢琴师剧本词云图 demo

import jieba  # 用 jieba 分词帮我们分词，并且过滤掉一些无意思的词
import wordcloud
import imageio  # 需要安装这个模块 pip3 install imageio

# 海上钢琴师剧本
text = """
1、All that city you just couldn't see an end to it.
城市那么大，看不到尽头。
2、The end? Please? Can you please just show me where it ends?
尽头在哪里？可以给我看看尽头么？
3、It was all very final on that gangway and I was grant too in my over coat. I cut quite a figure, and I was getting off, guaranteed. That wasn't the problem.
当年，我踏上跳板，不觉得困难。我穿上大衣，很神气，自觉一表人才，有决心，有把握，有信心。
4、It wasn't what I saw that stopped me, Max, it was what I didn't see. Do you understand that? What I didn't see.
我停下来，不是因为所见，是因为所不见。你明不明白？是因为看不见的东西。
5、In all that sprawling city, there was everything except an end. There was no end.
连绵不绝的城市，什么都有，除了尽头，没有尽头。
6、What I did not see was where the whole thing came to an end, the end of the world.
我看不见城市的尽头，我需要看见世界尽头。
7、Take a piano, hmm? The keys begin, the keys end. You know there are 88 of them. Nobody can tell you different. They are not infinite
拿钢琴来说，键盘有始也有终。有88个键，错不了，并不是无限的。
8、You are infinite. And on those keys, the music that you can make is infinite.
音乐是无限的。在琴键上，奏出无限的音乐
9、I like that. That I can live by.
我喜欢那样，我应付得来。
"""

# 用 jieba 分词，分词并将分好的词用空格连接起来，成一个字符串
seg_list = jieba.cut(text, cut_all=True)
txt = " ".join(seg_list)
# print(txt)

stopwords = {"end", "尽头"}  # 指定要排除的关键词

mask = imageio.imread("钢琴师.jpg")  # 指定生成词云图的形状（会根据图片来生成）

# 生成词云对象，并配置基本配置
wc = wordcloud.WordCloud(background_color="white",  # 生成词云图的背景颜色
                         max_words=50,  # 最多展示多少个关键词
                         font_path="苹方黑体-细-简.ttf",  # 有中文文本必须指定字体文件（下载下来安装好，否则图片上是小方块）
                         mask=mask,
                         margin=20,  # 字与边框之间的距离
                         max_font_size=50,  # 最大字的大小
                         stopwords=stopwords)
wc.generate(txt)
wc.to_file("output.png")  # 将词云保存成文件

钢琴师.jpg

output.png 生成词云效果

生成词云通用代码

网上找到的一个案例，试着跑了一下，报错了 numpy.AxisError: axis 1 is out of bounds for array of dimension 1

根据报错排查半天，找不到原因，直觉告诉我，是参数有问题，它的 .get 没有默认值，那如果找不到就是 None 了， width 啥的是 None 好像不太合适，结果给了个 .get 默认值，结果还真是这个问题。。。

emmm，给他加了默认值，调了下参数顺序与参数名

你可以直接使用这个案例来生成词云图，而不需要自己去封装（如果你要更加定制化的话，可以完善它，例如给 wordcloud 添加停止词、自定义 jieba 分词词典，让分词结果更符合预想）

emmm，提一嘴，数据源文本最好不要太多，不然会很占用内存

import jieba
import wordcloud
from imageio import imread  # 如果没有，需要安装，pip3 install imageio


def make_cloud(input_text_file, output_file, **kwargs):
    """
    制作词云的通用代码
    :param input_text_file: 输入文本的路径或名字
    :param output_file: 输出图片的路径或名字
    :param kwargs:   WordCloud的参数(width、height、background_color、font_path、max_words)
    :return:
    """
    width = kwargs.get("width", 400)  # 必须给个默认值，不指定程序会报错，因为其默认类型是整数，而你给了个 None
    height = kwargs.get("height", 200)
    background_color = kwargs.get("background_color")
    font_path = kwargs.get("font_path")
    max_words = kwargs.get("max_words", 200)

    f = open(input_text_file, "r", encoding="utf-8")
    data = f.read()
    f.close()

    ls = jieba.lcut(data)  # 分词
    txt = " ".join(ls)  # 将列表中的单词连接成一个字符串

    w = wordcloud.WordCloud(width=width, height=height, background_color=background_color, font_path=font_path,
                            max_words=max_words)
    w.generate(txt)
    w.to_file(output_file)


def make_cloud_png(input_text_file, shape_file, output_file, **kwargs):
    """
    按给定图形形状制作词云的通用代码
    :param input_text_file: 输入文本的路径或名字
    :param shape_file:  设置词云的图片形状的文件路径或名字
    :param output_file: 输出图片的路径或名字
    :param kwargs: WordCloud的参数(width、height、background_color、font_path、max_words)
    :return:
    """
    width = kwargs.get("width", 400)
    height = kwargs.get("height", 200)
    background_color = kwargs.get("background_color")
    font_path = kwargs.get("font_path")
    max_words = kwargs.get("max_words", 200)
    mask = imread(shape_file)

    f = open(input_text_file, "r", encoding="utf-8")
    data = f.read()
    f.close()

    ls = jieba.cut(data)  # 分词
    txt = " ".join(ls)  # 将列表中的单词连接成一个字符串

    w = wordcloud.WordCloud(width=width, height=height, background_color=background_color, font_path=font_path,
                            max_words=max_words, mask=mask)
    w.generate(txt)
    w.to_file(output_file)


if __name__ == '__main__':
    make_cloud_png("test_text.txt", "钢琴师.jpg", "output.png", font_path="苹方黑体-细-简.ttf")

posted @ 2019-12-23 02:17 suwanbin 阅读(1057) 评论(0) 收藏举报

刷新页面返回顶部

Suwanbin

RealizeInnerSelf丶