當前位置：首頁 > 编程语言 > python >内容正文

python

python 对excel文件进行分词并进行词频统计_python 词频分析

發布時間：2024/10/8 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 对excel文件进行分词并进行词频统计_python 词频分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python詞頻分析

昨天看到幾行關于用 python 進行詞頻分析的代碼，深刻感受到了 python 的強大之處。(尤其是最近自己為了在學習 c 語言感覺被它的語法都快搞炸了，python 從來沒有那么多要求)

代碼如下：

import?re
def?parse(text):
????#?使用正則表達式去除標點符號和換行符
????text?=?re.sub(r'[^\w?]',?'?',?text)
????text?=?text.lower()?#轉化為小寫
????word_list?=?text.split('?')??#生成列表

????#?去除空白單詞
????word_list?=?filter(None,?word_list)

????#?生成單詞和詞頻的字典
????word_cnt?=?{}
????for?word?in?word_list:
????????if?word?not?in?word_cnt:
????????????word_cnt[word]?=?0
????????word_cnt[word]?+=?1

????#?按照詞頻排序
????sorted_word_cnt?=?sorted(word_cnt.items(),?key=lambda?kv:?kv[1],?reverse=True)?#逆序排列

????return?sorted_word_cnt

with?open('in.txt',?'r')?as?fin:????#讀取文件
?text?=?fin.read()??

word_and_freq?=?parse(text)

with?open('out.txt',?'w')?as?fout:???#將結果寫入文件
?for?word,?freq?in?word_and_freq:
??fout.write('{}?{}\n'.format(word,freq))

另外分析材料如下

?I?have?a?dream?that?my?four?little?children?will?one?day?live?in?a?nation?where?they?will?not?be?judged?by?the?color?of?their?skin?but?by?the?content?of?their?character.?I?have?a?dream?today.

I?have?a?dream?that?one?day?down?in?Alabama,?with?its?vicious?racists,?.?.?.?one?day?right?there?in?Alabama?little?black?boys?and?black?girls?will?be?able?to?join?hands?with?little?white?boys?and?white?girls?as?sisters?and?brothers.?I?have?a?dream?today.

I?have?a?dream?that?one?day?every?valley?shall?be?exalted,?every?hill?and?mountain?shall?be?made?low,?the?rough?places?will?be?made?plain,?and?the?crooked?places?will?be?made?straight,?and?the?glory?of?the?Lord?shall?be?revealed,?and?all?flesh?shall?see?it?together.

This?is?our?hope.?.?.?With?this?faith?we?will?be?able?to?hew?out?of?the?mountain?of?despair?a?stone?of?hope.?With?this?faith?we?will?be?able?to?transform?the?jangling?discords?of?our?nation?into?a?beautiful?symphony?of?brotherhood.?With?this?faith?we?will?be?able?to?work?together,?to?pray?together,?to?struggle?together,?to?go?to?jail?together,?to?stand?up?for?freedom?together,?knowing?that?we?will?be?free?one?day.?.?.?.

And?when?this?happens,?and?when?we?allow?freedom?ring,?when?we?let?it?ring?from?every?village?and?every?hamlet,?from?every?state?and?every?city,?we?will?be?able?to?speed?up?that?day?when?all?of?God's?children,?black?men?and?white?men,?Jews?and?Gentiles,?Protestants?and?Catholics,?will?be?able?to?join?hands?and?sing?in?the?words?of?the?old?Negro?spiritual:?"Free?at?last!?Free?at?last!?Thank?God?Almighty,?we?are?free?at?last!"

代碼很簡單，這里我再做簡單的講解

with open('in.txt', 'r') as fin: ?#使用 with 的方便之處就在于不必擔心何時 open(),何時 close()，讓系統自己去判斷。
這里可以看以前寫的另一篇非常簡單的文章里邊有 with 簡單介紹。
另外，你有沒有發現讀取文件的地方有問題？這里給出的 in.txt' 文件比較小，可以直接使用fin.read()` ，如果文件比較大，可能會直接讓你的內存崩潰。這里readline()逐行讀取是比較合適的
```
with open('in.txt', 'r') as fin:
? ?for text in fin.readline():
? ? ? ?word_cnt = parse(text)

可以調用中文詞頻分析庫?`jieba`?
>?簡單介紹下 jieba 庫中文分詞原理：
1、利用一個中文詞庫確定漢字之間的關聯概率
2、漢字間概率大的組成詞組，形成分詞結果
3、除了分詞，用戶還可以添加自定義詞組
>?常用方法：
jieba.str(str)：接受三個輸入參數；需要分詞的字符串 cut_all 參數用來控制是否采用全模式、HMM 參數用來控制是否使用 HMM模型，返回生成器
jieba.lcut(str):?精確模式，直接一個列表類型的分詞結果。參數同上。

jieba.lcut("中國是一個偉大的國家")
['中國', '是', '一個', '偉大', '的', '國家']

jieba.lcut_for_search(str)?:?搜索引擎模式，返回一個列表類型的分詞結果

jieba.lcut_for_search("中華人民共和國是偉大的")
['中華', '華人', '人民', '共和', '共和國', '中華人民共和國', '是', '偉大', '的']

`jieba`?庫安裝
一般大家安裝的python都自帶有pip
>?pip?--version??#查看下當前pip版本，如果版本太低需要先升級才行
>?pip?install?jieba??#正常情況下利用?pip?可以直接進行?jieba?庫的下載，如果下載失敗多試幾次，我這里嘗試了兩次下載成功
安裝之后只需要在命令行輸入?python?進入交互模式嘗試輸入?`import?jieba`?一般情況都不會出錯的
####?下面就開始玩一下中文詞頻分析
之前《斗羅大陸》動漫挺火的，我就在把這篇小說下載了下來，就拿它當素材好了
>?這里先對一個簡單的文本進行測試，內容如下：
巴蜀，歷來有天府之國的美譽，其中，最有名的門派莫過于唐門。
import?jieba

def?parse(text):
?words?=?jieba.lcut(text)?#使用jieba.lcut()返回一個單詞列表
?print(words)

with?open("douluo_1.txt",?"r",?encoding="utf-8")?as?fin:??#這里必須要指定utf-8模式，因為我這里的txt文本就是utf-8模式的，大家可以根據自己需要用編輯器修改就好了，一般的文本編輯器都有這種功能
?text?=?fin.read()
?parse(text)

先嘗試運行一下，看代碼能否執行

運行結果：
Building prefix dict from the default dictionary …
Loading model from cache C:\Users\92039\AppData\Local\Temp\jieba.cache
Loading model cost 1.289 seconds.
Prefix dict has been built succesfully.
['巴蜀', '，', '歷來', '有', '天府之國', '的', '美譽', '，', '其中', '，', '最', '有名', '的', '門派', '莫過于', '唐門', '。', '\n']
[Finished in 3.4s]
接下來對其進行詞頻統計

import?jieba

def?parse(text):
?words?=?jieba.lcut(text)?#使用jieba.lcut()返回一個單詞列表
?words_dict?=?{}?#創建一個字典，用于生成單詞，頻率
?for?word?in?words:
??words_dict[word]?=?words_dict.get(word,0)+1?#get不到word就創建word為下標的值0+1，如果get到了就在word的值上加1，然后更新字典
?#words_dict?=?list(words_dict)
?words_dict_sorted?=?sorted(words_dict.items(),?key=lambda?kv:kv[1],?reverse?=?True)?
?return?words_dict_sorted

with?open('douluo_1.txt',?'r',?encoding?=?'utf-8')?as?fin:
?text?=?fin.read()
word_and_freq?=?parse(text)

with?open('douluo_out.txt',?'w')?as?fout:
?for?word,?freq?in?word_and_freq:
??print("{}?{}\n".format(word,?freq))

運行結果

Building prefix dict from the default dictionary …
Loading model from cache C:\Users\92039\AppData\Local\Temp\jieba.cache
Loading model cost 1.337 seconds.
Prefix dict has been built succesfully.
， 3

的 2

巴蜀 1

歷來 1

有 1

天府之國 1

美譽 1

其中 1

最 1

有名 1

門派 1

莫過于 1

唐門 1

。1

[Finished in 3.4s]
發現結果中把標點符號也計算在內，這里需要引入正則表達式
加一行代碼
text = re.sub(r'[^\w]', ' ', text)
代碼運行后發現竟然把空格也計算在內了
心累啊。。。。再加一行代碼
text = filter(None, text)
運行提示 'filter' object has no attribute 'decode'
因為引入了 utf-8 編碼 filter 不能 decode

算了，那就換一種方法

再介紹另一個概念：

停用詞表
停用詞：停用詞是指在信息檢索中，為節省存儲空間和提高搜索效率，在處理自然語言數據(或文本)之前或之后會自動過濾掉某些字或詞，這些字或詞即被稱為Stop Words(停用詞)
停用詞表便是存儲了這些停用詞的文件。在網上下載停用詞表，命名filter.txt
這個方法應該就可以了吧，試一下(我這里在 github 上下載的百度停用詞表)
嘗試了一下發現還是會計算標點符號，這里手動把所有中文標點輸入過濾文件

??4???
巴蜀?1
歷來?1
天府之國?1
美譽?1
最?1
有名?1
門派?1
莫過于?1
唐門?1

別問我為啥會有個4，我也不知道我嘗試了各種方法為啥還有個4，原本是個5，我將待分析文件中一個空格刪除了，變成了4。我查了查一共4個標點符號…… 懵逼了
忽然想起另一種方法：舍棄字數為1的詞，也就是把單個的詞統統給扔了，代碼如下：

import?jieba
import?re

def?parse(text):
?text?=?re.sub(r'[^\w]',?'?'?,?text)
?#text?=?filter(None,?text)

?words?=?jieba.lcut(text)?#使用jieba.lcut()返回一個單詞列表
?#加載停用詞
?stopwords?=?[line.strip()?for?line?in?open('filter.txt',encoding='utf-8').readlines()]??

?words_dict?=?{}?#創建一個字典，用于生成單詞，頻率
?for?word?in?words:??
????#不在停用詞表中??
??if?word?not?in?stopwords:
???if?len(word)?==?1:
????continue
???else:??
????words_dict[word]?=?words_dict.get(word,0)?+?1?#get不到word就創建word為下標的值0+1，如果get到了就在word的值上加1，然后更新字典

?#words_dict?=?list(words_dict)
?words_dict_sorted?=?sorted(words_dict.items(),?key=lambda?kv:kv[1],?reverse?=?True)?
?return?words_dict_sorted

with?open('douluo_1.txt',?'r',?encoding?=?'utf-8')?as?fin:
?text?=?fin.read()
stopwords?=?[line.strip()?for?line?in?open('filter.txt',encoding='utf-8').readlines()]??

word_and_freq?=?parse(text)

with?open('douluo_out.txt',?'w')?as?fout:
?for?word,?freq?in?word_and_freq:
??fout.write("{}?{}\n".format(word,?freq))

好了，下面把《斗羅大陸》整個文檔放進去試試，感覺就我這垃圾電腦運行應該很慢。。。唉。。。啥時候有錢了一定要買個 mbp 現在好好努力等我考上研一定得買個，發完牢騷代碼終于運行完畢

歡迎關注公眾號哦！

總結

以上是生活随笔為你收集整理的python 对excel文件进行分词并进行词频统计_python 词频分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python leetcode_pyth
下一篇： websocket python爬虫_p