词云Wordcloud是文本数据的一种可视化表示方式。它通过设置不同的字体大小或颜色来表现每个术语的重要性。词云在社交媒体中被广泛使用，因为它能够让读者快速感知最突出的术语。然而，词云的输出结果没有统一的标准，也缺乏逻辑性。对于词频相差较大的词汇有较好的区分度，但对于颜色相近、频次相近的词汇来说效果并不好。因此词云不适合应用于科学绘图。本文基于python库wordcloud来绘制词云。wordcloud安装方式如下：

pip install wordcloud

文章目录

0 wordcloud绘图说明
1 绘图实例
2 参考

0 wordcloud绘图说明

wordcloud库关于绘制词云的相关函数均由其内置类WordCloud提供。

WordCloud类初始函数如下：

WordCloud(font_path=None, width=400, height=200, margin=2,

          ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,

          color_func=None, max_words=200, min_font_size=4,

          stopwords=None, random_state=None, background_color='black',

          max_font_size=None, font_step=1, mode="RGB",

          relative_scaling='auto', regexp=None, collocations=True,

          colormap=None, normalize_plurals=True, contour_width=0,

          contour_color='black', repeat=False,

          include_numbers=False, min_word_length=0, collocation_threshold=30)

初始函数参数介绍如下：

参数	类型	说明
font_path	str	字体路径，中文词云绘制必须要提供字体路径
width	int	输出画布宽度
height	int	输出画布高度
margin	int	输出画布每个词汇边框边距
prefer_horizontal	float	词汇水平方向排版出现的频率
mask	numpy-array	为空使用默认mask绘制词云，非空用给定mask绘制词云且宽高值将被忽略
scale	float	按照比例放大画布长宽
color_func	func	颜色设置函数
max_words	int	最大统计词数
min_font_size	int	最小字体尺寸
stopwords	list	绘图要过滤的词
random_state	int	随机数，主要用于设置颜色
background_color	str	背景颜色
max_font_size	int	最大字体尺寸
font_step	int	字体步长
mode	str	pillow image的绘图模式
relative_scaling	float	词频和字体大小的关联性
regexp	str	使用正则表达式分隔输入的文本
collocations	bool	是否包括两个词的搭配
colormap	str	给每个单词随机分配颜色，若指定color_func，则忽略该方法
normalize_plurals	bool	英文单词是否用单数替换复数
contour_width	int	词云轮廓尺寸
contour_color	str	词云轮廓颜色
repeat	bool	是否重复输入文本直到允许的最大词数
include_numbers	bool	是否包含数字作为短语
min_word_length	int	单词包含最少字母数

WordCloud类提供的主要函数接口如下：

generate_from_frequencies(frequencies)：根据词频生成词云
fit_words(frequencies)：等同generate_from_frequencies函数
process_text(text)：分词
generate_from_text(text)：根据文本生成词云
generate(text)：等同generate_from_text
to_image：输出绘图结果为pillow image
recolor：重置颜色
to_array：输出绘图结果为numpy array
to_file(filename)：保存为文件
to_svg：保存为svg文件

1 绘图实例

1.1 单个单词绘制词云

import numpy as np

import matplotlib.pyplot as plt

from wordcloud import WordCloud

text = "hello"

# 返回两个数组，只不过数组维度分别为n*1 和 1* m

x, y = np.ogrid[:300, :300]

# 设置绘图区域

mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2

mask = 255 * mask.astype(int)

# 绘制词云，repeat表示重复输入文本直到允许的最大词数max_words，scale设置放大比例

wc = WordCloud(background_color="white", repeat=True,max_words=32, mask=mask,scale=1.5)

wc.generate(text)

plt.axis("off")

plt.imshow(wc, interpolation="bilinear")

plt.show()

# 输出到文件

_ = wc.to_file("result.jpg")

1.2 基础绘制



from wordcloud import WordCloud

# 文本地址

text_path = 'test.txt'

# 示例文本

scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本

with open(text_path,'w',encoding='utf-8') as f:

    f.write(scr_text)

# 读取文本

with open(text_path,'r',encoding='utf-8') as f:

    # 这里text是一个字符串

    text = f.read()

# 生成词云， WordCloud对输入的文本text进行切词展示。

wordcloud = WordCloud().generate(text)

import matplotlib.pyplot as plt

plt.axis("off")

plt.imshow(wordcloud, interpolation='bilinear')

plt.show()

# 修改显示的最大的字体大小

wordcloud = WordCloud(max_font_size=50).generate(text)

# 另外一种展示结果方式

image = wordcloud.to_image()

image.show()

1.3 自定义词云形状

from PIL import Image

import numpy as np

import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

# 文本地址

text_path = 'test.txt'

# 示例文本

scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本

with open(text_path,'w',encoding='utf-8') as f:

    f.write(scr_text)

# 读取文本

with open(text_path,'r',encoding='utf-8') as f:

    # 这里text是一个字符串

    text = f.read()

# 想生成带特定形状的词云，首先得准备具备该形状的mask图片

# 在mask图片中除了目标形状外，其他地方都是空白的

mask = np.array(Image.open("mask.png"))

# 要跳过的词

stopwords = set(STOPWORDS)

# 去除better

stopwords.add("better")

# contour_width绘制mask边框宽度，contour_color设置mask区域颜色

# 如果mask边框绘制不准，设置contour_width=0表示不绘制边框

wc = WordCloud(background_color="white", max_words=2000, mask=mask,

               stopwords=stopwords, contour_width=2, contour_color='red',scale=2,repeat=True)

# 生成图片

wc.generate(text)

# 存储文件

wc.to_file("result.png")

# 展示词云结果

plt.imshow(wc, interpolation='bilinear')

plt.axis("off")

plt.figure()

# 展示mask图片

plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')

plt.axis("off")

plt.show()

1.4 使用词频字典绘图

# pip install multidict安装

import multidict as multidict

import numpy as np

import re

from PIL import Image

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# 统计词频

def getFrequencyDictForText(sentence):

    fullTermsDict = multidict.MultiDict()

    tmpDict = {}

    # 按照空格分词

    for text in sentence.split(" "):

        # 如果匹配到相关词，就跳过，这样做可以获得定制度更高的结果

        if re.match("a|the|an|the|to|in|for|of|or|by|with|is|on|that|be", text):

            continue

        val = tmpDict.get(text, 0)

        tmpDict[text.lower()] = val + 1

    # 生成词频字典

    for key in tmpDict:

        fullTermsDict.add(key, tmpDict[key])

    return fullTermsDict

def makeImage(text):

    mask = np.array(Image.open("mask.png"))

    wc = WordCloud(background_color="white", max_words=1000, mask=mask, repeat=True)

    wc.generate_from_frequencies(text)

    plt.imshow(wc, interpolation="bilinear")

    plt.axis("off")

    plt.show()

# 文本地址

text_path = 'test.txt'

# 示例文本

scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本

with open(text_path,'w',encoding='utf-8') as f:

    f.write(scr_text)

# 读取文本

with open(text_path,'r',encoding='utf-8') as f:

    # 这里text是一个字符串

    text = f.read()

# 获得词频字典

fullTermsDict = getFrequencyDictForText(text)

# 绘图

makeImage(fullTermsDict)

1.5 颜色更改

from PIL import Image

import numpy as np

import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# 文本地址

text_path = 'test.txt'

# 示例文本

scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本

with open(text_path,'w',encoding='utf-8') as f:

    f.write(scr_text)

# 读取文本

with open(text_path,'r',encoding='utf-8') as f:

    # 这里text是一个字符串

    text = f.read()

# 图片地址https://github.com/amueller/word_cloud/blob/master/examples/alice_color.png

alice_coloring = np.array(Image.open("alice_color.png"))

stopwords = set(STOPWORDS)

stopwords.add("better")

wc = WordCloud(background_color="white", max_words=500, mask=alice_coloring,

               stopwords=stopwords, max_font_size=50, random_state=42,repeat=True)

# 生成词云结果

wc.generate(text)

# 绘制

image = wc.to_image()

image.show()

# 绘制类似alice_coloring颜色的词云图片

# 从图片中提取颜色

image_colors = ImageColorGenerator(alice_coloring)

# 重新设置词云颜色

wc.recolor(color_func=image_colors)

# 绘制

image = wc.to_image()

image.show()

# 展示mask图片

plt.imshow(alice_coloring, cmap=plt.cm.gray, interpolation='bilinear')

plt.axis("off")

plt.show()

1.6 为特定词设置颜色

from wordcloud import (WordCloud, get_single_color_func)

import matplotlib.pyplot as plt

# 直接赋色函数

class SimpleGroupedColorFunc(object):

    def __init__(self, color_to_words, default_color):

        # 特定词颜色

        self.word_to_color = {word: color

                              for (color, words) in color_to_words.items()

                              for word in words}

        # 默认词颜色

        self.default_color = default_color

    def __call__(self, word, **kwargs):

        return self.word_to_color.get(word, self.default_color)

class GroupedColorFunc(object):

    def __init__(self, color_to_words, default_color):

        self.color_func_to_words = [

            (get_single_color_func(color), set(words))

            for (color, words) in color_to_words.items()]

        self.default_color_func = get_single_color_func(default_color)

    def get_color_func(self, word):

        """Returns a single_color_func associated with the word"""

        try:

            color_func = next(

                color_func for (color_func, words) in self.color_func_to_words

                if word in words)

        except StopIteration:

            color_func = self.default_color_func

        return color_func

    def __call__(self, word, **kwargs):

        return self.get_color_func(word)(word, **kwargs)

text = """The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!"""

# 直接输入文本时，在统计数据时是否包括两个词的搭配

wc = WordCloud(collocations=False).generate(text.lower())

# 为特定词设置颜色

color_to_words = {

    'green': ['beautiful', 'explicit', 'simple', 'sparse',

                'readability', 'rules', 'practicality',

                'explicitly', 'one', 'now', 'easy', 'obvious', 'better'],

    '#FF00FF': ['ugly', 'implicit', 'complex', 'complicated', 'nested',

            'dense', 'special', 'errors', 'silently', 'ambiguity',

            'guess', 'hard']

}

# 设置除特定词外其他词的颜色为grey

default_color = 'grey'

# 直接赋色函数，直接按照color_to_words设置的RGB颜色绘图，输出的颜色不够精细

# grouped_color_simple = SimpleGroupedColorFunc(color_to_words, default_color)

# 更精细的赋色函数，将color_to_words设置的RGB颜色转到hsv空间，然后进行绘图

grouped_color = GroupedColorFunc(color_to_words, default_color)

# 应用颜色函数

wc.recolor(color_func=grouped_color)

# 绘图

plt.figure()

plt.imshow(wc, interpolation="bilinear")

plt.axis("off")

plt.show()

1.7 绘制中文词云

import jieba

import matplotlib.pyplot as plt

from wordcloud import WordCloud, ImageColorGenerator

import numpy as np

# 读取文本

# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/CalltoArms.txt

with open('CalltoArms.txt','r',encoding='utf-8') as f:

    text = f.read()

# 中文必须设置字体文件

# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/fonts/SourceHanSerif/SourceHanSerifK-Light.otf

font_path =  'SourceHanSerifK-Light.otf'

# 不用于绘制词云的词汇列表

# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/stopwords_cn_en.txt

stopwords_path = 'stopwords_cn_en.txt'

# 词云

# 模板图片

back_coloring = np.array(Image.open("alice_color.png"))

# 向jieba分词词典添加新的词语

userdict_list = ['阿Ｑ', '孔乙己', '单四嫂子']

# 分词

def jieba_processing_txt(text):

    for word in userdict_list:

        jieba.add_word(word)

    mywordlist = []

    # 分词

    seg_list = jieba.cut(text, cut_all=False)

    liststr = "/ ".join(seg_list)

    with open(stopwords_path, encoding='utf-8') as f_stop:

        f_stop_text = f_stop.read()

        f_stop_seg_list = f_stop_text.splitlines()

    for myword in liststr.split('/'):

        if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1:

            mywordlist.append(myword)

    return ' '.join(mywordlist)

# 文字处理

text = jieba_processing_txt(text)

# margin设置词云每个词汇边框边距

wc = WordCloud(font_path=font_path, background_color="black", max_words=2000, mask=back_coloring,

               max_font_size=100, random_state=42, width=1000, height=860, margin=5,

               contour_width=2,contour_color='blue')

wc.generate(text)

# 获得颜色

image_colors_byImg = ImageColorGenerator(back_coloring)

plt.imshow(wc.recolor(color_func=image_colors_byImg), interpolation="bilinear")

plt.axis("off")

plt.figure()

plt.imshow(back_coloring, interpolation="bilinear")

plt.axis("off")

plt.show()

2 参考

[python] 基于wordcloud库绘制词云图的更多相关文章

[python] 基于diagrams库绘制系统架构图
Python的Diagrams库允许通过简单的Python代码绘制云系统架构,实现对新的系统架构进行原型设计.Diagrams的官方仓库地址见:diagrams.Diagrams的官方文档和使用示例见 ...
python 绘制词云图
1. 先下载并安装nltk包,准备一张简单的图片存入代码所在文件目录,搜集英文停用词表 import nltk nltk.download() 2. 绘制词云图 import re import nu ...
Python pyecharts绘制词云图
一.pyecharts绘制词云图WordCloud.add()方法简介 WordCloud.add()方法简介 add(name,attr,value, shape="circle" ...
python 基于 wordcloud + jieba + matplotlib 生成词云
词云词云是啥?词云突出一个数据可视化,酷炫.以前以为很复杂,不想python已经有成熟的工具来做词云.而我们要做的就是准备关键词数据,挑一款字体,挑一张模板图片,非常非常无脑.准备好了吗,快跟我一起 ...
使用pyecharts绘制词云图-淘宝商品评论展示
一.什么是词云图? 词云图是一种用来展现高频关键词的可视化表达,通过文字.色彩.图形的搭配,产生有冲击力地视觉效果,而且能够传达有价值的信息. 制作词云图的网站有很多,简单方便,适合小批量操作. BI ...
Python基于jieba的中文词云
今日学习了python的词云技术 from os import path from wordcloud import WordCloud import matplotlib.pyplot as plt ...
【Python成长之路】词云图制作
[写在前面] 以前看到过一些大神制作的词云图 ,觉得效果很有意思.如果有朋友不了解词云图的效果,可以看下面的几张图(图片都是网上找到的): 网上找了找相关的软件,有些软件制作还要付费.结果前几天在大 ...
python使用turtle库绘制奥运五环
效果图: #奥运五环 import turtle turtle.setup(1.0,1.0) #设置窗口大小 turtle.title("奥运五环") #蓝圆 turtle.pen ...
[python] 基于blind-watermark库添加图片盲水印
blind-watermark是一个能够给图片添加/解析基于频域的数字盲水印的Python库.图像水印image watermark是指在图片里添加文本或图形,以标记图片的来源.但是图像水印会破坏原图 ...

随机推荐

掌控(control) 方法记录
掌控(control) 题面描述公元\(2044\)年,人类进入了宇宙纪元.L国有\(n\)个星球,分别编号为\(1\)到\(n\),每一星球上有一个球长.有些球长十分强大,可以管理或掌控其他星球的 ...
Pytest进阶使用
fixture 特点: 命令灵活:对于setup,teardown可以省略数据共享:在conftest.py配置里写方法可以实现数据共享,不需要import导入,可以跨文件共享 scope的层次及神 ...
一篇文章带你掌握MyBatis简化框架——MyBatisPlus
一篇文章带你掌握MyBatis简化框架--MyBatisPlus 我们在前面的文章中已经学习了目前开发所需的主流框架类似于我们所学习的SpringBoot框架用于简化Spring开发,我们的国人大大 ...
《吐血整理》高级系列教程-吃透Fiddler抓包教程(31)-Fiddler如何抓取Android系统中Flutter应用程序的包
1.简介 Flutter是谷歌的移动UI框架,可以快速在iOS和Android上构建高质量的原生用户界面.Flutter应用程序是用Dart编写的,这是一种由Google在7年多前创建的语言.Flut ...
Typora图床上传配置：PicGo+Gitee 不完全指南
每次写Markdown都要手动传图,再复制链接到Typora里,这样比较繁琐. 设置好图床,搭配PicGo,写作时直接剪贴图片到Typora,就能实现自动上传,这样就方便很多. Gitee配置: 许多 ...
Dubbo-聊聊通信模块设计
前言 Dubbo源码阅读分享系列文章,欢迎大家关注点赞 SPI实现部分 Dubbo-SPI机制 Dubbo-Adaptive实现原理 Dubbo-Activate实现原理 Dubbo SPI-Wrap ...
win10 优化
禁用服务中的某些不用的服务例如 SysMain 服务 win+R键开启运行快捷方式,输入services.msc: 找到SysMain这个服务: 选择这个服务后,右键属性: 点击停止:启动类型选择禁 ...
嵌入式-C语言基础：结构体
数组只能存放一种类型的数据,而结构体内可以存放不同类型的数据. #include<stdio.h> #include <string.h> struct Student { c ...
2022-11-05 Acwing每日一题
本系列所有题目均为Acwing课的内容,发表博客既是为了学习总结,加深自己的印象,同时也是为了以后回过头来看时,不会感叹虚度光阴罢了,因此如果出现错误,欢迎大家能够指出错误,我会认真改正的.同时也希望 ...
4.django-模板
在django中,模板引擎(DTL)是一种可以让开发者将服务端数据填充到html页面中的完成渲染的技术模板引擎的原理分为以下三步: 在项目配置文件中指定保存模板文件的的模板目录,一般设置在项目根目录 ...

[python] 基于wordcloud库绘制词云图