當(dāng)前位置：首頁(yè) > 人工智能 > 循环神经网络 >内容正文

循环神经网络

matlab可以使用词云分析吗,利用豆瓣短评数据生成词云

發(fā)布時(shí)間：2023/12/10 循环神经网络 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 matlab可以使用词云分析吗,利用豆瓣短评数据生成词云小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在之前的文章中，我們獲得了豆瓣爬取的短評(píng)內(nèi)容，匯總到了一個(gè)文件中，但是，沒(méi)有被利用起來(lái)的數(shù)據(jù)是沒(méi)有意義的。

前文提到，有一篇微信推文的關(guān)于詞云制作的一個(gè)實(shí)踐記錄，準(zhǔn)備照此試驗(yàn)一下。

思路分析

讀文件

利用with open() as...將文件讀進(jìn)來(lái)。這里需要注意文件內(nèi)容的大小。

分詞

由于獲取的是大量的短評(píng)文字，而制作詞云需要的是各種詞語(yǔ)，有了詞，才能談詞云，所以目前第一步需求的就是講短評(píng)內(nèi)容拆分成一個(gè)個(gè)的中文詞匯。

這里就用到了我所聽(tīng)過(guò)的一個(gè)庫(kù)jieba，可以將中文語(yǔ)句拆解成一個(gè)個(gè)的詞匯。這里是用的是lcut()方法，能將中文字符串拆解成一個(gè)列表，每項(xiàng)都是一個(gè)詞。

清洗非中文

但是，我們?cè)诜治鲋?#xff0c;需要的就是中文文字，所以需要將非中文字符徹底清理，這里使用了正則表達(dá)式。短小精悍的一個(gè)模式[\u4e00-\u9fa5]+即可匹配。

使用正則表達(dá)式，我的習(xí)慣是現(xiàn)在網(wǎng)上的一些在線正則表達(dá)式工具上直接測(cè)試。其中oschina的不錯(cuò)，還給提供了一些例子。

這里是oschina的工具網(wǎng)站，做的很好。

處理停詞

由于這些詞匯中，有很多詞是沒(méi)有實(shí)際分析價(jià)值的，所以我們需要利用一個(gè)停詞文件來(lái)將不必要的詞處理掉。

參考文章中，是利用pandas庫(kù)匯總的方法read_csv()來(lái)處理停詞文件。，利用一個(gè)isin()方法實(shí)現(xiàn)了停詞。

聚合

詞分開(kāi)了，基本也處理干凈了。接下來(lái)應(yīng)該考慮制作詞云的問(wèn)題。

我們這里想要重點(diǎn)突出在所有評(píng)論中的重要的核心觀點(diǎn)，為了實(shí)現(xiàn)這樣的目的，我們使用了分詞。

這似乎是一種有些“斷章取義”的思路。借助詞頻的分布實(shí)現(xiàn)重點(diǎn)突出高詞頻內(nèi)容的方式，來(lái)展現(xiàn)我們的詞云。

所以現(xiàn)在我們需要做的事，就是處理詞匯的聚合問(wèn)題，統(tǒng)計(jì)詞頻而已。

參考文種中利用了類DataFrame的分組方法group()和聚合方法agg()。

關(guān)于這里，參考文章中在agg()中使用了一個(gè)顯式的字典(可見(jiàn)文末參考文章)，調(diào)用了numpy.size，但是似乎是這種用法將來(lái)會(huì)被移除，查了一些文章，說(shuō)是可以這樣用，就是不能自己定制字典了。

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

詞云

這里使用了第三方庫(kù)wordcloud。這個(gè)庫(kù)在安裝的時(shí)候，直接pip install wordcloud時(shí)，我出了問(wèn)題，提示微軟開(kāi)發(fā)工具的問(wèn)題，折騰了半天，最后還是直接在一個(gè)極為豐富的第三方庫(kù)的集合站點(diǎn)上下載使用pip insatll了它的whl文件。

這下可以正常使用了。

同時(shí)，這里為了能夠顯示處理圖片，使用了matplotlib.pyplot&numpy來(lái)進(jìn)行處理。

掩膜設(shè)置

由wordcloud項(xiàng)目主頁(yè)README 了解，可以使用二值圖像來(lái)設(shè)定掩膜(mask)。

出于提升數(shù)據(jù)的表現(xiàn)力，也出于學(xué)習(xí)的目的，這里使用了直接編寫(xiě)的rgb2gray()&gray2bw()函數(shù)來(lái)實(shí)現(xiàn)真彩圖像轉(zhuǎn)換為二值圖像的過(guò)程。獲得了最終的二值圖像掩膜。

這里開(kāi)始我并不知道需要怎樣的圖像，看了給的示例代碼，用的圖片的是二值圖像，才明白，白白浪費(fèi)了好多時(shí)間。

而且，我的理解，由彩色轉(zhuǎn)為二值圖像，是必要經(jīng)過(guò)灰度圖像這個(gè)過(guò)程的。

關(guān)于matplotlib.pyplot的使用，網(wǎng)上都說(shuō)，和matlab的語(yǔ)法很類似，以前了解過(guò)一點(diǎn)，所以看著例子中的imshow()，很自然的就想出了imread()，實(shí)現(xiàn)了圖片的讀取。

在查閱文檔的過(guò)程中發(fā)現(xiàn)了一個(gè)有意思的地方。

Return value is a numpy.array. For grayscale images, the return array is MxN. For RGB images, the return value is MxNx3. For RGBA images the return value is MxNx4.

matplotlib can only read PNGs natively, but if PIL is installed, it will use it to load the image and return an array (if possible) which can be used with imshow(). Note, URL strings may not be compatible with PIL. Check the PIL documentation for more information.

我文中使用的是JPG圖像，可見(jiàn)是調(diào)用了PIL處理。

而這里對(duì)于二值圖像的獲取，開(kāi)始經(jīng)歷了一個(gè)誤區(qū)。由于在網(wǎng)上搜索的時(shí)候，搜到的大多是利用PIL庫(kù)的Image模塊的open()&convert()方法的處理，附加參數(shù)1，可以實(shí)現(xiàn)二值圖像的轉(zhuǎn)化，但是在這里使用，后面在使用詞云的時(shí)候，會(huì)提示缺少屬性，可見(jiàn)這里不適合這樣處理。

詞云設(shè)定

詞云支持自定義字體，背景顏色，掩膜設(shè)置等等，可以直接在IDE中跳至源文件中查看。都有相關(guān)的介紹。

文末代碼是一些參數(shù)的摘錄。

詞頻選擇

這里使用了剛才聚合排序好的數(shù)據(jù)，選擇了前1000個(gè)詞進(jìn)行展示，并組合成字典，傳入了詞云的實(shí)例對(duì)象的方法fit_words()生成了詞云。

詞云展示

這里使用了matplotlib.pyplot的的幾個(gè)函數(shù)，實(shí)現(xiàn)了圖像的保存，顯示，以及坐標(biāo)軸的隱藏。

這里倒是有個(gè)小異或，有點(diǎn)分不清楚imshow()與show()了。兩者從文檔我也沒(méi)看出個(gè)所以然來(lái)。不過(guò)他們有個(gè)最明顯的區(qū)別就是后者依賴圖形窗口，但是前者似乎不需要。

要是有明白的，還請(qǐng)大家留言或者發(fā)郵件給我。

完整代碼

# -*- coding: utf-8 -*-

"""

Created on Thu Aug 17 16:31:35 2017

@note: 為了便于閱讀，將模塊的引用就近安置了

@author: lart

"""

# 讀取事先爬取好的文件，由于文件較小，直接一次性讀入。若文件較大，則最好分體積讀入。

with open('秘密森林的短評(píng).txt', 'r', encoding='utf-8') as file:

comments = file.readlines()

comment = ''.join(comments)

# 摘取中文字符，沒(méi)有在下載時(shí)處理，正好保留原始數(shù)據(jù)。

import re

pattern = re.compile(r'[\u4e00-\u9fa5]+')

data = pattern.findall(comment)

filted_comment = ''.join(data)

# 分詞

import jieba

word = jieba.lcut(filted_comment)

# 整理

import pandas as pd

words_df = pd.DataFrame({'words': word})

#停詞相關(guān)設(shè)置。參數(shù) quoting=3 全不引用

stopwords = pd.read_csv(

"stopwords.txt",

index_col=False,

quoting=3,

sep="\t",

names=['stopword'],

encoding='utf-8'

)

words_df = words_df[~words_df.words.isin(stopwords.stopword)]

# 聚合

words_stat = words_df.groupby('words')['words'].agg({'size'})

words_stat = words_stat.reset_index().sort_values("size", ascending=False)

# 詞云設(shè)置

from wordcloud import WordCloud

import matplotlib.pyplot as plt

import numpy as np

def rgb2gray(rgb):

return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])

def gray2bw(gray):

for raw in range(len(gray)):

for col in range(len(gray[raw])):

gray[raw][col] = (0 if gray[raw][col]>50 else 255)

return gray

img = plt.imread('4.jpg')

mask = rgb2gray(img)

bw = gray2bw(mask)

wordcloud = WordCloud(

font_path="YaHei Consolas Hybrid.ttf",

background_color="white",

mask=bw,

max_font_size=80

)

# word_frequence 為字典類型，可以直接傳入wordcloud.fit_words()

word_frequence = {

x[0]:x[1] for x in words_stat.head(1000).values

}

wordcloud = wordcloud.fit_words(word_frequence)

# 存儲(chǔ)顯示

plt.imsave('img.jpg', wordcloud)

plt.subplot(131)

plt.imshow(img)

plt.axis("off")

plt.subplot(132)

plt.imshow(bw)

plt.axis("off")

plt.subplot(133)

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

結(jié)果文件

使用的掩膜原圖片：

秘密森林劇照

輸出圖片

IDE輸出結(jié)果

這里寫(xiě)圖片描述

停詞文件

Parameters

----------

font_path : string

Font path to the font that will be used (OTF or TTF).

Defaults to DroidSansMono path on a Linux machine. If you are on

another OS or don't have this font, you need to adjust this path.

width : int (default=400)

Width of the canvas.

height : int (default=200)

Height of the canvas.

prefer_horizontal : float (default=0.90)

The ratio of times to try horizontal fitting as opposed to vertical.

If prefer_horizontal < 1, the algorithm will try rotating the word

if it doesn't fit. (There is currently no built-in way to get only

vertical words.)

mask : nd-array or None (default=None)

If not None, gives a binary mask on where to draw words. If mask is not

None, width and height will be ignored and the shape of mask will be

used instead. All white (#FF or #FFFFFF) entries will be considerd

"masked out" while other entries will be free to draw on. [This

changed in the most recent version!]

scale : float (default=1)

Scaling between computation and drawing. For large word-cloud images,

using scale instead of larger canvas size is significantly faster, but

might lead to a coarser fit for the words.

min_font_size : int (default=4)

Smallest font size to use. Will stop when there is no more room in this

size.

font_step : int (default=1)

Step size for the font. font_step > 1 might speed up computation but

give a worse fit.

max_words : number (default=200)

The maximum number of words.

stopwords : set of strings or None

The words that will be eliminated. If None, the build-in STOPWORDS

list will be used.

background_color : color value (default="black")

Background color for the word cloud image.

max_font_size : int or None (default=None)

Maximum font size for the largest word. If None, height of the image is

used.

mode : string (default="RGB")

Transparent background will be generated when mode is "RGBA" and

background_color is None.

relative_scaling : float (default=.5)

Importance of relative word frequencies for font-size. With

relative_scaling=0, only word-ranks are considered. With

relative_scaling=1, a word that is twice as frequent will have twice

the size. If you want to consider the word frequencies and not only

their rank, relative_scaling around .5 often looks good.

.. versionchanged: 2.0

Default is now 0.5.

color_func : callable, default=None

Callable with parameters word, font_size, position, orientation,

font_path, random_state that returns a PIL color for each word.

Overwrites "colormap".

See colormap for specifying a matplotlib colormap instead.

regexp : string or None (optional)

Regular expression to split the input text into tokens in process_text.

If None is specified, ``r"\w[\w']+"`` is used.

collocations : bool, default=True

Whether to include collocations (bigrams) of two words.

.. versionadded: 2.0

colormap : string or matplotlib colormap, default="viridis"

Matplotlib colormap to randomly draw colors from for each word.

Ignored if "color_func" is specified.

.. versionadded: 2.0

normalize_plurals : bool, default=True

Whether to remove trailing 's' from words. If True and a word

appears with and without a trailing 's', the one with trailing 's'

is removed and its counts are added to the version without

trailing 's' -- unless the word ends with 'ss'.

總結(jié)

以上是生活随笔為你收集整理的matlab可以使用词云分析吗,利用豆瓣短评数据生成词云的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： class里面只能写以下5种
下一篇：三瞬属性matlab,matlab：ou