當前位置：首頁 > 编程语言 > python >内容正文

python

jieba分词工具的使用-python代码

發布時間：2023/12/10 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 jieba分词工具的使用-python代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

jieba

“結巴”中文分詞：做最好的 Python 中文分詞組件

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Scroll down for English documentation.

特點

支持三種分詞模式：
- 精確模式，試圖將句子最精確地切開，適合文本分析；
- 全模式，把句子中所有的可以成詞的詞語都掃描出來, 速度非常快，但是不能解決歧義；
- 搜索引擎模式，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用于搜索引擎分詞。
支持繁體分詞
支持自定義詞典
MIT 授權協議

在線演示

http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

網站代碼：https://github.com/fxsjy/jiebademo

安裝說明

代碼對 Python 2/3 均兼容

全自動安裝：easy_install jieba?或者?pip install jieba?/?pip3 install jieba
半自動安裝：先下載?http://pypi.python.org/pypi/jieba/?，解壓后運行?python setup.py install
手動安裝：將 jieba 目錄放置于當前目錄或者 site-packages 目錄
通過?import jieba?來引用

算法

基于前綴詞典實現高效的詞圖掃描，生成句子中漢字所有可能成詞情況所構成的有向無環圖 (DAG)
采用了動態規劃查找最大概率路徑, 找出基于詞頻的最大切分組合
對于未登錄詞，采用了基于漢字成詞能力的 HMM 模型，使用了 Viterbi 算法

主要功能

1. 分詞

jieba.cut?方法接受三個輸入參數: 需要分詞的字符串；cut_all 參數用來控制是否采用全模式；HMM 參數用來控制是否使用 HMM 模型
jieba.cut_for_search?方法接受兩個參數：需要分詞的字符串；是否使用 HMM 模型。該方法適合用于搜索引擎構建倒排索引的分詞，粒度比較細
待分詞的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意：不建議直接輸入 GBK 字符串，可能無法預料地錯誤解碼成 UTF-8
jieba.cut?以及?jieba.cut_for_search?返回的結構都是一個可迭代的 generator，可以使用 for 循環來獲得分詞后得到的每一個詞語(unicode)，或者用
jieba.lcut?以及?jieba.lcut_for_search?直接返回 list
jieba.Tokenizer(dictionary=DEFAULT_DICT)?新建自定義分詞器，可用于同時使用不同詞典。jieba.dt?為默認分詞器，所有全局分詞相關函數都是該分詞器的映射。

代碼示例

# encoding=utf-8 import jiebaseg_list = jieba.cut("我來到北京清華大學", cut_all=True) print("Full Mode: " + "/ ".join(seg_list)) # 全模式seg_list = jieba.cut("我來到北京清華大學", cut_all=False) print("Default Mode: " + "/ ".join(seg_list)) # 精確模式seg_list = jieba.cut("他來到了網易杭研大廈") # 默認是精確模式 print(", ".join(seg_list))seg_list = jieba.cut_for_search("小明碩士畢業于中國科學院計算所，后在日本京都大學深造") # 搜索引擎模式 print(", ".join(seg_list))

輸出:

【全模式】: 我/ 來到/ 北京/ 清華/ 清華大學/ 華大/ 大學【精確模式】: 我/ 來到/ 北京/ 清華大學【新詞識別】：他, 來到, 了, 網易, 杭研, 大廈 (此處，“杭研”并沒有在詞典中，但是也被Viterbi算法識別出來了)【搜索引擎模式】：小明, 碩士, 畢業, 于, 中國, 科學, 學院, 科學院, 中國科學院, 計算, 計算所, 后, 在, 日本, 京都, 大學, 日本京都大學, 深造

2. 添加自定義詞典

載入詞典

開發者可以指定自己自定義的詞典，以便包含 jieba 詞庫里沒有的詞。雖然 jieba 有新詞識別能力，但是自行添加新詞可以保證更高的正確率
用法： jieba.load_userdict(file_name) # file_name 為文件類對象或自定義詞典的路徑
詞典格式和?dict.txt?一樣，一個詞占一行；每一行分三部分：詞語、詞頻（可省略）、詞性（可省略），用空格隔開，順序不可顛倒。file_name?若為路徑或二進制方式打開的文件，則文件必須為 UTF-8 編碼。
詞頻省略時使用自動計算的能保證分出該詞的詞頻。

例如：

創新辦 3 i 云計算 5 凱特琳 nz 臺中

更改分詞器（默認為?jieba.dt）的?tmp_dir?和?cache_file?屬性，可分別指定緩存文件所在的文件夾及其文件名，用于受限的文件系統。
范例：
- 自定義詞典：https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
- 用法示例：https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
  - 之前：李小福 / 是 / 創新 / 辦 / 主任 / 也 / 是 / 云 / 計算 / 方面 / 的 / 專家 /
  - 加載自定義詞庫后：　李小福 / 是 / 創新辦 / 主任 / 也 / 是 / 云計算 / 方面 / 的 / 專家 /

調整詞典

使用?add_word(word, freq=None, tag=None)?和?del_word(word)?可在程序中動態修改詞典。
使用?suggest_freq(segment, tune=True)?可調節單個詞語的詞頻，使其能（或不能）被分出來。
注意：自動計算的詞頻在使用 HMM 新詞發現功能時可能無效。

代碼示例：

>>> print('/'.join(jieba.cut('如果放到post中將出錯。', HMM=False))) 如果/放到/post/中將/出錯/。 >>> jieba.suggest_freq(('中', '將'), True) 494 >>> print('/'.join(jieba.cut('如果放到post中將出錯。', HMM=False))) 如果/放到/post/中/將/出錯/。 >>> print('/'.join(jieba.cut('「臺中」正確應該不會被切開', HMM=False))) 「/臺/中/」/正確/應該/不會/被/切開 >>> jieba.suggest_freq('臺中', True) 69 >>> print('/'.join(jieba.cut('「臺中」正確應該不會被切開', HMM=False))) 「/臺中/」/正確/應該/不會/被/切開

"通過用戶自定義詞典來增強歧義糾錯能力" ---?https://github.com/fxsjy/jieba/issues/14

3. 關鍵詞提取

基于 TF-IDF 算法的關鍵詞抽取

import jieba.analyse

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence 為待提取的文本
- topK 為返回幾個 TF/IDF 權重最大的關鍵詞，默認值為 20
- withWeight 為是否一并返回關鍵詞權重值，默認值為 False
- allowPOS 僅包括指定詞性的詞，默認值為空，即不篩選
jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 實例，idf_path 為 IDF 頻率文件

代碼示例（關鍵詞提取）

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

關鍵詞提取所使用逆向文件頻率（IDF）文本語料庫可以切換成自定義語料庫的路徑

用法： jieba.analyse.set_idf_path(file_name) # file_name為自定義語料庫的路徑
自定義語料庫示例：https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
用法示例：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

關鍵詞提取所使用停止詞（Stop Words）文本語料庫可以切換成自定義語料庫的路徑

用法： jieba.analyse.set_stop_words(file_name) # file_name為自定義語料庫的路徑
自定義語料庫示例：https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
用法示例：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

關鍵詞一并返回關鍵詞權重值示例

用法示例：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py

基于 TextRank 算法的關鍵詞抽取

jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) 直接使用，接口相同，注意默認過濾詞性。
jieba.analyse.TextRank() 新建自定義 TextRank 實例

算法論文：?TextRank: Bringing Order into Texts

基本思想:

將待抽取關鍵詞的文本進行分詞

以固定窗口大小(默認為5，通過span屬性調整)，詞之間的共現關系，構建圖

計算圖中節點的PageRank，注意是無向帶權圖

使用示例:

見?test/demo.py

4. 詞性標注

jieba.posseg.POSTokenizer(tokenizer=None)?新建自定義分詞器，tokenizer?參數可指定內部使用的jieba.Tokenizer?分詞器。jieba.posseg.dt?為默認詞性標注分詞器。
標注句子分詞后每個詞的詞性，采用和 ictclas 兼容的標記法。
用法示例

>>> import jieba.posseg as pseg >>> words = pseg.cut("我愛北京天安門") >>> for word, flag in words: ... print('%s %s' % (word, flag)) ... 我 r 愛 v 北京 ns 天安門 ns

5. 并行分詞

原理：將目標文本按行分隔后，把各行文本分配到多個 Python 進程并行分詞，然后歸并結果，從而獲得分詞速度的可觀提升
基于 python 自帶的 multiprocessing 模塊，目前暫不支持 Windows
用法：
- jieba.enable_parallel(4)?# 開啟并行分詞模式，參數為并行進程數
- jieba.disable_parallel()?# 關閉并行分詞模式
例子：https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
實驗結果：在 4 核 3.4GHz Linux 機器上，對金庸全集進行精確分詞，獲得了 1MB/s 的速度，是單進程版的 3.3 倍。
注意：并行分詞僅支持默認分詞器?jieba.dt?和?jieba.posseg.dt。

6. Tokenize：返回詞語在原文的起止位置

注意，輸入參數只接受 unicode
默認模式

result = jieba.tokenize(u'永和服裝飾品有限公司') for tk in result:print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])) word 永和 start: 0 end:2 word 服裝 start: 2 end:4 word 飾品 start: 4 end:6 word 有限公司 start: 6 end:10

搜索模式

result = jieba.tokenize(u'永和服裝飾品有限公司', mode='search') for tk in result:print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])) word 永和 start: 0 end:2 word 服裝 start: 2 end:4 word 飾品 start: 4 end:6 word 有限 start: 6 end:8 word 公司 start: 8 end:10 word 有限公司 start: 6 end:10

7. ChineseAnalyzer for Whoosh 搜索引擎

引用：?from jieba.analyse import ChineseAnalyzer
用法示例：https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py

8. 命令行分詞

使用示例：python -m jieba news.txt > cut_result.txt

命令行選項（翻譯）：

使用: python -m jieba [options] filename結巴命令行界面。固定參數:filename 輸入文件可選參數:-h, --help 顯示此幫助信息并退出-d [DELIM], --delimiter [DELIM]使用 DELIM 分隔詞語，而不是用默認的' / '。若不指定 DELIM，則使用一個空格分隔。-p [DELIM], --pos [DELIM]啟用詞性標注；如果指定 DELIM，詞語和詞性之間用它分隔，否則用 _ 分隔-D DICT, --dict DICT 使用 DICT 代替默認詞典-u USER_DICT, --user-dict USER_DICT使用 USER_DICT 作為附加詞典，與默認詞典或自定義詞典配合使用-a, --cut-all 全模式分詞（不支持詞性標注）-n, --no-hmm 不使用隱含馬爾可夫模型-q, --quiet 不輸出載入信息到 STDERR-V, --version 顯示版本信息并退出如果沒有指定文件名，則使用標準輸入。

--help?選項輸出：

$> python -m jieba --help Jieba command line interface.positional arguments:filename input fileoptional arguments:-h, --help show this help message and exit-d [DELIM], --delimiter [DELIM]use DELIM instead of ' / ' for word delimiter; or aspace if it is used without DELIM-p [DELIM], --pos [DELIM]enable POS tagging; if DELIM is specified, use DELIMinstead of '_' for POS delimiter-D DICT, --dict DICT use DICT as dictionary-u USER_DICT, --user-dict USER_DICTuse USER_DICT together with the default dictionary orDICT (if specified)-a, --cut-all full pattern cutting (ignored with POS tagging)-n, --no-hmm don't use the Hidden Markov Model-q, --quiet don't print loading messages to stderr-V, --version show program's version number and exitIf no filename specified, use STDIN instead.

延遲加載機制

jieba 采用延遲加載，import jieba?和?jieba.Tokenizer()?不會立即觸發詞典的加載，一旦有必要才開始加載詞典構建前綴字典。如果你想手工初始 jieba，也可以手動初始化。

import jieba jieba.initialize() # 手動初始化（可選）

在 0.28 之前的版本是不能指定主詞典的路徑的，有了延遲加載機制后，你可以改變主詞典的路徑:

jieba.set_dictionary('data/dict.txt.big')

例子：?https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py

其他詞典

占用內存較小的詞典文件?https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

支持繁體分詞更好的詞典文件?https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

下載你所需要的詞典，然后覆蓋 jieba/dict.txt 即可；或者用?jieba.set_dictionary('data/dict.txt.big')

其他語言實現

結巴分詞 Java 版本

作者：piaolingxue 地址：https://github.com/huaban/jieba-analysis

結巴分詞 C++ 版本

作者：yanyiwu 地址：https://github.com/yanyiwu/cppjieba

結巴分詞 Node.js 版本

作者：yanyiwu 地址：https://github.com/yanyiwu/nodejieba

結巴分詞 Erlang 版本

作者：falood 地址：https://github.com/falood/exjieba

結巴分詞 R 版本

作者：qinwf 地址：https://github.com/qinwf/jiebaR

結巴分詞 iOS 版本

作者：yanyiwu 地址：https://github.com/yanyiwu/iosjieba

結巴分詞 PHP 版本

作者：fukuball 地址：https://github.com/fukuball/jieba-php

結巴分詞 .NET(C#) 版本

作者：anderscui 地址：https://github.com/anderscui/jieba.NET/

結巴分詞 Go 版本

作者: wangbin 地址:?https://github.com/wangbin/jiebago
作者: yanyiwu 地址:?https://github.com/yanyiwu/gojieba

系統集成

Solr:?https://github.com/sing1ee/jieba-solr

分詞速度

1.5 MB / Second in Full Mode
400 KB / Second in Default Mode
測試環境: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz；《圍城》.txt

常見問題

1. 模型的數據是如何生成的？

詳見：?https://github.com/fxsjy/jieba/issues/7

2. “臺中”總是被切成“臺中”？（以及類似情況）

P(臺中) ＜ P(臺)×P(中)，“臺中”詞頻不夠導致其成詞概率較低

解決方法：強制調高詞頻

jieba.add_word('臺中')?或者?jieba.suggest_freq('臺中', True)

3. “今天天氣不錯”應該被切成“今天天氣不錯”？（以及類似情況）

解決方法：強制調低詞頻

jieba.suggest_freq(('今天', '天氣'), True)

或者直接刪除該詞?jieba.del_word('今天天氣')

4. 切出了詞典中沒有的詞語，效果不理想？

解決方法：關閉新詞發現

jieba.cut('豐田太省了', HMM=False)?jieba.cut('我們中出了一個叛徒', HMM=False)

更多問題請點擊：https://github.com/fxsjy/jieba/issues?sort=updated&state=closed

修訂歷史

https://github.com/fxsjy/jieba/blob/master/Changelog

jieba

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Features

Support three types of segmentation mode:

Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.

Full Mode gets all the possible words from the sentence. Fast but not accurate.

Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

Supports Traditional Chinese
Supports customized dictionaries
MIT License

Online demo

http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

Usage

Fully automatic installation:?easy_install jieba?or?pip install jieba
Semi-automatic installation: Download?http://pypi.python.org/pypi/jieba/?, run?python setup.py install?after extracting.
Manual installation: place the?jieba?directory in the current directory or python?site-packages?directory.
import jieba.

Algorithm

Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
Use dynamic programming to find the most probable combination based on the word frequency.
For unknown words, a HMM-based model is used with the Viterbi algorithm.

Main Functions

1. Cut

The?jieba.cut?function accepts three input parameters: the first parameter is the string to be cut; the second parameter iscut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
jieba.cut_for_search?accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
jieba.cut?and?jieba.cut_for_search?returns an generator, from which you can use a?for?loop to get the segmentation result (in unicode).
jieba.lcut?and?jieba.lcut_for_search?returns a list.
jieba.Tokenizer(dictionary=DEFAULT_DICT)?creates a new customized Tokenizer, which enables you to use different dictionaries at the same time.?jieba.dt?is the default Tokenizer, to which almost all global functions are mapped.

Code example: segmentation

#encoding=utf-8 import jiebaseg_list = jieba.cut("我來到北京清華大學", cut_all=True) print("Full Mode: " + "/ ".join(seg_list)) # 全模式seg_list = jieba.cut("我來到北京清華大學", cut_all=False) print("Default Mode: " + "/ ".join(seg_list)) # 默認模式seg_list = jieba.cut("他來到了網易杭研大廈") print(", ".join(seg_list))seg_list = jieba.cut_for_search("小明碩士畢業于中國科學院計算所，后在日本京都大學深造") # 搜索引擎模式 print(", ".join(seg_list))

Output:

[Full Mode]: 我/ 來到/ 北京/ 清華/ 清華大學/ 華大/ 大學[Accurate Mode]: 我/ 來到/ 北京/ 清華大學[Unknown Words Recognize] 他, 來到, 了, 網易, 杭研, 大廈 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)[Search Engine Mode]：小明, 碩士, 畢業, 于, 中國, 科學, 學院, 科學院, 中國科學院, 計算, 計算所, 后, 在, 日本, 京都, 大學, 日本京都大學, 深造

2. Add a custom dictionary

　Load dictionary

Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
Usage：?jieba.load_userdict(file_name)?# file_name is a file-like object or the path of the custom dictionary
The dictionary format is the same as that of?dict.txt: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If?file_name?is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.

For example:

創新辦 3 i 云計算 5 凱特琳 nz 臺中

Change a Tokenizer's?tmp_dir?and?cache_file?to specify the path of the cache file, for using on a restricted file system.
Example:
云計算 5 李小福 2 創新辦 3[Before]：李小福 / 是 / 創新 / 辦 / 主任 / 也 / 是 / 云 / 計算 / 方面 / 的 / 專家 /[After]：　李小福 / 是 / 創新辦 / 主任 / 也 / 是 / 云計算 / 方面 / 的 / 專家 /

Modify dictionary

Use?add_word(word, freq=None, tag=None)?and?del_word(word)?to modify the dictionary dynamically in programs.
Use?suggest_freq(segment, tune=True)?to adjust the frequency of a single word so that it can (or cannot) be segmented.
Note that HMM may affect the final result.

Example:

3. Keyword Extraction

import jieba.analyse

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence: the text to be extracted
- topK: return how many keywords with the highest TF/IDF weights. The default value is 20
- withWeight: whether return TF/IDF weights with the keywords. The default value is False
- allowPOS: filter words with which POSs are included. Empty for no filtering.
jieba.analyse.TFIDF(idf_path=None)?creates a new TFIDF instance,?idf_path?specifies IDF file path.

Example (keyword extraction)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

Developers can specify their own custom IDF corpus in jieba keyword extraction

Usage：?jieba.analyse.set_idf_path(file_name) # file_name is the path for the custom corpus
Custom Corpus Sample：https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
Sample Code：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

Developers can specify their own custom stop words corpus in jieba keyword extraction

Usage：?jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpus
Custom Corpus Sample：https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
Sample Code：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

There's also a?TextRank?implementation available.

Use:?jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))

Note that it filters POS by default.

jieba.analyse.TextRank()?creates a new TextRank instance.

4. Part of Speech Tagging

jieba.posseg.POSTokenizer(tokenizer=None)?creates a new customized Tokenizer.?tokenizer?specifies the jieba.Tokenizer to internally use.?jieba.posseg.dt?is the default POSTokenizer.
Tags the POS of each word after segmentation, using labels compatible with ictclas.
Example:

>>> import jieba.posseg as pseg >>> words = pseg.cut("我愛北京天安門") >>> for w in words: ... print('%s %s' % (w.word, w.flag)) ... 我 r 愛 v 北京 ns 天安門 ns

5. Parallel Processing

Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
Based on the multiprocessing module of Python.
Usage:
- jieba.enable_parallel(4)?# Enable parallel processing. The parameter is the number of processes.
- jieba.disable_parallel()?# Disable parallel processing.
Example:?https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
Note?that parallel processing supports only default tokenizers,?jieba.dt?and?jieba.posseg.dt.

6. Tokenize: return words with position

The input must be unicode
Default mode

Search mode

result = jieba.tokenize(u'永和服裝飾品有限公司',mode='search') for tk in result:print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])) word 永和 start: 0 end:2 word 服裝 start: 2 end:4 word 飾品 start: 4 end:6 word 有限 start: 6 end:8 word 公司 start: 8 end:10 word 有限公司 start: 6 end:10

7. ChineseAnalyzer for Whoosh

from jieba.analyse import ChineseAnalyzer
Example:?https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py

8. Command Line Interface

Initialization

By default, Jieba don't build the prefix dictionary unless it's necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:

import jieba jieba.initialize() # (optional)

You can also specify the dictionary (not supported before version 0.28) :

jieba.set_dictionary('data/dict.txt.big')

Using Other Dictionaries

It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:

A smaller dictionary for a smaller memory footprint:?https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

There is also a bigger dictionary that has better support for traditional Chinese (繁體):https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

By default, an in-between dictionary is used, called?dict.txt?and included in the distribution.

In either case, download the file you want, and then call?jieba.set_dictionary('data/dict.txt.big')?or just replace the existing?dict.txt.

Segmentation speed

1.5 MB / Second in Full Mode
400 KB / Second in Default Mode
Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz；《圍城》.txt

#-*-coding:utf-8-*- __author__ = '蘇葉' # ###jieba特性介紹 # 支持三種分詞模式： # 精確模式，試圖將句子最精確地切開，適合文本分析； # 全模式，把句子中所有的可以成詞的詞語都掃描出來, 速度非?？?#xff0c;但是不能解決歧義； # 搜索引擎模式，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用于搜索引擎分詞。 # 支持繁體分詞。 # 支持自定義詞典。 # MIT 授權協議。# ###分詞速度 # 1.5 MB / Second in Full Mode # 400 KB / Second in Default Mode # 測試環境: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHz；《圍城》.txt# #一、第一部分# ##Part 1. 分詞 # jieba.cut的默認參數只有三個,jieba源碼如下# cut(self, sentence, cut_all=False, HMM=True)# jieba.cut的默認參數只有三個,jieba源碼如下# cut(self, sentence, cut_all=False, HMM=True) # 分別為:輸入文本是否為全模式分詞與是否開啟HMM進行中文分詞(隱馬爾科夫模型) # jieba.cut_for_search 方法接受兩個參數：需要分詞的字符串；是否使用 HMM 模型。該方法適合用于搜索引擎構建倒排索引的分詞，粒度比較細。 # 待分詞的字符串可以是 unicode 或 UTF-8 字符串、GBK 字符串。注意：不建議直接輸入 GBK 字符串，可能無法預料地錯誤解碼成 UTF-8。 # jieba.cut 以及 jieba.cut_for_search 返回的結構都是一個可迭代的 generator，可以使用 for 循環來獲得分詞后得到的每一個詞語(unicode)，或者用 # jieba.lcut 以及 jieba.lcut_for_search 直接返回 list。 # jieba.Tokenizer(dictionary=DEFAULT_DICT) 新建自定義分詞器，可用于同時使用不同詞典。jieba.dt 為默認分詞器，所有全局分詞相關函數都是該分詞器的映射。# 1.全模式 import jieba data1=jieba.cut("我來到山東師范大學",cut_all=True) print("全模式下："+"/".join(data1))#全模式下：我/來到/山東/山東師范大學/師范/師范大學/大學# 2.精確模式（也是默認的模式） data2=jieba.cut("我來到山東師范大學",cut_all=False)#這玩意默認也是這個，等同于 data2=jieba.cut("我來到山東師范大學") print("精確模式下："+"/".join(data2))#精確模式下：我/來到/山東師范大學# 3.搜索引擎模式 data3=jieba.cut_for_search("我來到山東師范大學") print("搜索引擎模式下："+"/".join(data3))#搜索引擎模式下：我/來到/山東/師范/大學/山東師范大學# ##Part 2. 添加自定義詞典# ###載入詞典 # 開發者可以指定自己自定義的詞典，以便包含 jieba 詞庫里沒有的詞。雖然 jieba 有新詞識別能力，但是自行添加新詞可以保證更高的正確率。 # 用法： jieba.load_userdict(file_name) # file_name 為自定義詞典的路徑。 # 詞典格式和dict.txt一樣，一個詞占一行；每一行分三部分，一部分為詞語，另一部分為詞頻（可省略），最后為詞性（可省略），用空格隔開。 # 詞頻可省略，使用計算出的能保證分出該詞的詞頻。 # 更改分詞器的 tmp_dir 和 cache_file 屬性，可指定緩存文件位置，用于受限的文件系統。# 舉個例子，比如創新辦等詞語，jieba可以會將其分為創新，辦兩部分，這個就體現了我們擴展詞匯的作用了 # 導入擴展詞匯之前 data4=jieba.cut("李小福是創新辦主任也是云計算方面的專家") print("導入擴展詞匯之前的分詞："+"/".join(data4))#導入擴展詞匯之前的分詞：李小福/是/創新/辦/主任/也/是/云/計算/方面/的/專家# 導入擴展詞匯之后 jieba.load_userdict("ext_words.txt")#加載擴展詞庫，里面就兩個詞匯：創新辦、云計算 data5=jieba.cut("李小福是創新辦主任也是云計算方面的專家") print("加載擴展詞匯之后："+"/".join(data5))#加載擴展詞匯之后：李小福/是/創新辦/主任/也/是/云計算/方面/的/專家# ###調整詞典# 使用 add_word(word, freq=None, tag=None) 和 del_word(word) 可在程序中動態修改詞典。 # 使用 suggest_freq(segment, tune=True) 可調節單個詞語的詞頻，使其能（或不能）被分出來。 # 注意：自動計算的詞頻在使用 HMM 新詞發現功能時可能無效。print("/".join(jieba.cut("如果放到post中將出錯。", HMM = False)))#如果/放到/post/中將/出錯/。#利用調節詞頻使“中”，“將”都能被分出來 jieba.suggest_freq(("中", "將"), tune = True) print("/".join(jieba.cut("如果放到post中將出錯。", HMM = False)))#如果/放到/post/中/將/出錯/。#動態修改詞典 Original = "/".join(jieba.cut("江州市長江大橋參加了長江大橋的通車儀式。", HMM = False)) print( "Original: " + Original)#Original: 江州/市/長江大橋/參加/了/長江大橋/的/通車/儀式/。 # 添加詞匯后 jieba.add_word("江大橋", freq = 20000, tag = None) print("/".join(jieba.cut("江州市長江大橋參加了長江大橋的通車儀式。")))#江州/市長/江大橋/參加/了/長江大橋/的/通車/儀式/。# ##Part 3. 詞性標注# jieba.posseg.POSTokenizer(tokenizer=None) 新建自定義分詞器，tokenizer 參數可指定內部使用的 jieba.Tokenizer 分詞器。jieba.posseg.dt 為默認詞性標注分詞器。 # 標注句子分詞后每個詞的詞性，采用和 ictclas 兼容的標記法。import jieba.posseg as pseg words = pseg.cut("我愛北京天安門。") for w in words:print("%s %s" %(w.word, w.flag)) #我 r #愛 v #北京 ns #天安門 ns #。 x# ##Part 4. 關鍵詞提取# ###基于 TF-IDF 算法的關鍵詞提取 # import jieba.analyse # jieba.analyse.extract_tags(sentence, topK = 20, withWeight = False, allowPOS = ()) # sentence:待提取的文本。 # topK:返回幾個 TF/IDF 權重最大的關鍵詞，默認值為20。 # withWeight:是否一并返回關鍵詞權重值，默認值為False。 # allowPOS:僅包括指定詞性的詞，默認值為空，即不進行篩選。 # jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 實例，idf_path 為 IDF 頻率文件。# optparse模塊OptionParser學習 # optparse是專門在命令行添加選項的一個模塊。from optparse import OptionParser MSG_USAGE = "myprog[ -f ][-s ] arg1[,arg2..]" optParser = OptionParser(MSG_USAGE) #以上，產生一個OptionParser的物件optParser。傳入的值MSG_USAGE可被調用打印命令時顯示出來。optParser.add_option("-f","--file",action = "store",type="string",dest = "fileName") optParser.add_option("-v","--vison", action="store_false", dest="verbose",default='gggggg',help="make lots of noise [default]") #調用OptionParser.add_option()添加選項，add_option()參數說明： #action:存儲方式，分為三種store, store_false, store_true #type:類型 #dest:存儲的變量 #default:默認值 #help:幫助信息fakeArgs = ['-f','file.txt','-v','good luck to you', 'arg2', 'arge'] options, args = optParser.parse_args(fakeArgs) print (options.fileName) print (options.verbose) print (options) print (args) #調用OptionParser.parse_args()剖析并返回一個directory和一個list #parse_args()說明: #如果沒有傳入參數，parse_args會默認將sys.argv[1:]的值作為默認參數。這里我們將fakeArgs模擬輸入的值。 #從返回結果中可以看到， #options為是一個directory,它的內容fakeArgs為“參數/值 ”的鍵值對。 #args 是一個list，它的內容是fakeargs除去options后，剩余的輸入內容。 #options.version和options.fileName都取到與options中的directory的值。print (optParser.print_help()) #輸出幫助信息 #optParser.print_help()說明： #1、最開始的的MSG_USAGE的值:在這個地方顯示出來了。 #2、自動添加了-h這個參數。# In[14]:import jieba.analyse as anl f = open("C:\\Users\\Luo Chen\\Desktop\\demo.txt", "r").read() seg = anl.extract_tags(f, topK = 20, withWeight = True) for tag, weight in seg:print ("%s %s" %(tag, weight))# 關鍵詞提取所使用逆向文件頻率（IDF）文本語料庫可以切換成自定義語料庫的路徑。 # jieba.analyse.set_idf_path(file_name) #file_name為自定義語料庫的路徑 # 如：jieba.analyse.set_idf_path("../extra_dict/idf.txt.big") # .big文件一般是游戲中的文件，比較常見的用途是裝載游戲的音樂、聲音等文件。 # # 關鍵詞提取所使用停用詞（Stop Words）文本語料庫可以切換成自定義語料庫的路徑。 # jieba.analyse.set_stop_words(file_name) #file_name為自定義語料庫的路徑。 # 如：jieba.analyse.set_stop_words("../extra_dict/stop_words.txt")# ###基于 TextRank 算法的關鍵詞提取# 基本思想: # 將待抽取關鍵詞的文本進行分詞； # 以固定窗口大小(默認為5，通過span屬性調整)，詞之間的共現關系，構建圖； # 計算圖中節點的PageRank，注意是無向帶權圖。 # jieba.analyse.textrank(sentence, topK = 20, withWeight = False, allowPOS = ('ns', 'n', 'v', 'nv')) 注意默認過濾詞性。 # jieba.analyse.TextRank() 新建自定義TextRank實例。# In[16]:s = "此外，公司擬對全資子公司吉林歐亞置業有限公司增資4.3億元，增資后，吉林歐亞置業注冊資本由7000萬元增加到5億元。吉林歐亞置業主要經營范圍為房地產開發及百貨零售等業務。目前在建吉林歐亞城市商業綜合體項目。2013年，實現營業收入0萬元，實現凈利潤-139.13萬元。" for x, w in jieba.analyse.textrank(s, topK = 5, withWeight = True):print("%s %s" % (x, w))# ##Part 5. 并行分詞（多進程分詞）# 原理：將目標文本按行分隔后，把各行文本分配到多個 Python 進程并行分詞，然后歸并結果，從而獲得分詞速度的可觀提升。 # 基于 python 自帶的 multiprocessing 模塊，目前暫不支持 Windows。 # 用法： # jieba.enable_parallel(4) # 開啟并行分詞模式，參數為并行進程數 # jieba.disable_parallel() # 關閉并行分詞模式 # 實驗結果：在 4 核 3.4GHz Linux 機器上，對金庸全集進行精確分詞，獲得了 1MB/s 的速度，是單進程版的 3.3 倍。 # 注意：并行分詞僅支持默認分詞器 jieba.dt 和 jieba.posseg.dt。# ##Part 6. Tokenize: 返回詞語在原文的起止位置# 注意：輸入參數只接受 unicode # 兩種模式：默認模式、搜索模式。# ###默認模式# In[19]:result = jieba.tokenize(u"永和服裝飾品有限公司") for tk in result:print("%s \t start at: %d \t end at: %d" %(tk[0], tk[1], tk[2]))# ###搜索模式 # 把句子中所有的可以成詞的詞語都掃描出來并確定位置。# In[20]:result = jieba.tokenize(u"永和服裝飾品有限公司", mode = "search") for tk in result:print("%s \t start at: %d \t end at: %d" % (tk[0], tk[1], tk[2]))# ##Part 7. 延遲加載機制 # jieba 采用延遲加載，import jieba 和 jieba.Tokenizer() 不會立即觸發詞典的加載，一旦有必要才開始加載詞典構建前綴字典。如果你想手工初始 jieba，也可以手動初始化。 # import jieba # jieba.initialize() #手動初始化（可選）# 在 0.28 之前的版本是不能指定主詞典的路徑的，有了延遲加載機制后，你可以改變主詞典的路徑: # jieba.set_dictionary("data/dict.txt.big") # 也可以下載你所需要的詞典，然后覆蓋jieba/dict.txt即可。# #二、第二部分# ##Part 1. 詞頻統計、降序排序# In[21]:article = open("C:\\Users\\Luo Chen\\Desktop\\demo_long.txt", "r").read() words = jieba.cut(article, cut_all = False) word_freq = {} for word in words:if word in word_freq:word_freq[word] += 1else:word_freq[word] = 1freq_word = [] for word, freq in word_freq.items():freq_word.append((word, freq)) freq_word.sort(key = lambda x: x[1], reverse = True)max_number = int(input(u"需要前多少位高頻詞？ "))for word, freq in freq_word[: max_number]:print (word, freq)# ##Part 2. 人工去停用詞# 標點符號、虛詞、連詞不在統計范圍內。# In[22]:stopwords = [] for word in open("C:\\Users\\Luo Chen\\Desktop\\stop_words.txt", "r"):stopwords.append(word.strip()) article = open("C:\\Users\\Luo Chen\\Desktop\\demo_long.txt", "r").read() words = jieba.cut(article, cut_all = False) stayed_line = "" for word in words:if word.encode("utf-8") not in stopwords:stayed_line += word + " " print (stayed_line)# ##Part 3. 合并同義詞# 將同義詞列舉出來，按下Tab鍵分隔，把第一個詞作為需要顯示的詞語，后面的詞語作為要替代的同義詞，一系列同義詞放在一行。 # 這里，“北京”、“首都”、“京城”、“北平城”、“故都”為同義詞。# In[24]:combine_dict = {}for line in open("C:\\Users\\Luo Chen\\Desktop\\tongyici.txt", "r"):seperate_word = line.strip().split("\t")num = len(seperate_word)for i in range(1, num):combine_dict[seperate_word[i]] = seperate_word[0]jieba.suggest_freq("北平城", tune = True) seg_list = jieba.cut("北京是中國的首都，京城的景色非常優美，就像當年的北平城，我愛這故都的一草一木。", cut_all = False) f = ",".join(seg_list) result = open("C:\\Users\\Luo Chen\\Desktop\\output.txt", "w") result.write(f.encode("utf-8")) result.close()for line in open("C:\\Users\\Luo Chen\\Desktop\\output.txt", "r"):line_1 = line.split(",")final_sentence = "" for word in line_1:if word in combine_dict:word = combine_dict[word]final_sentence += wordelse:final_sentence += word print (final_sentence)# ##Part 4. 詞語提及率# 主要步驟：分詞——過濾停用詞（略）——替代同義詞——計算詞語在文本中出現的概率。# In[31]:origin = open("C:\\Users\\Luo Chen\\Desktop\\tijilv.txt", "r").read() jieba.suggest_freq("晨媽媽", tune = True) jieba.suggest_freq("大黑牛", tune = True) jieba.suggest_freq("能力者", tune = True) seg_list = jieba.cut(origin, cut_all = False) f = ",".join(seg_list)output_1 = open("C:\\Users\\Luo Chen\\Desktop\\output_1.txt", "w") output_1.write(f.encode("utf-8")) output_1.close()combine_dict = {} for w in open("C:\\Users\\Luo Chen\\Desktop\\tongyici.txt", "r"):w_1 = w.strip().split("\t")num = len(w_1)for i in range(0, num):combine_dict[w_1[i]] = w_1[0]seg_list_2 = "" for i in open("C:\\Users\\Luo Chen\\Desktop\\output_1.txt", "r"):i_1 = i.split(",")for word in i_1:if word in combine_dict:word = combine_dict[word]seg_list_2 += wordelse:seg_list_2 += word print (seg_list_2)# In[35]:freq_word = {} seg_list_3 = jieba.cut(seg_list_2, cut_all = False) for word in seg_list_3:if word in freq_word:freq_word[word] += 1else:freq_word[word] = 1freq_word_1 = [] for word, freq in freq_word.items():freq_word_1.append((word, freq)) freq_word_1.sort(key = lambda x: x[1], reverse = True) for word, freq in freq_word_1:print( word, freq)total_freq = 0 for i in freq_word_1:total_freq += i[1]for word, freq in freq_word.items():freq = float(freq) / float(total_freq)print( word, freq)# ##Part 5. 按詞性提取# In[36]:import jieba.posseg as pseg word = pseg.cut("李晨好帥，又能力超強，是“大黑?！?#xff0c;也是一個能力者，還是隊里貼心的晨媽媽。") for w in word:if w.flag in ["n", "v", "x"]:print (w.word, w.flag)

總結

以上是生活随笔為你收集整理的jieba分词工具的使用-python代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：升级ADT22.6后，Android模拟
下一篇：基于python中jieba包的详细使用

python

jieba分词工具的使用-python代码

特點

在線演示

安裝說明

算法

主要功能

1. 分詞

2. 添加自定義詞典

載入詞典

調整詞典

3. 關鍵詞提取

基于 TF-IDF 算法的關鍵詞抽取

基于 TextRank 算法的關鍵詞抽取

4. 詞性標注

5. 并行分詞

6. Tokenize：返回詞語在原文的起止位置

7. ChineseAnalyzer for Whoosh 搜索引擎

8. 命令行分詞

延遲加載機制

其他詞典

其他語言實現

結巴分詞 Java 版本

結巴分詞 C++ 版本

結巴分詞 Node.js 版本

結巴分詞 Erlang 版本

結巴分詞 R 版本

結巴分詞 iOS 版本

結巴分詞 PHP 版本

結巴分詞 .NET(C#) 版本

結巴分詞 Go 版本

系統集成

分詞速度

常見問題

1. 模型的數據是如何生成的？

2. “臺中”總是被切成“臺 中”？（以及類似情況）

3. “今天天氣 不錯”應該被切成“今天 天氣 不錯”？（以及類似情況）

4. 切出了詞典中沒有的詞語，效果不理想？

修訂歷史

jieba

Features

Online demo

Usage

Algorithm

Main Functions

1. Cut

2. Add a custom dictionary

Load dictionary

Modify dictionary

3. Keyword Extraction

4. Part of Speech Tagging

5. Parallel Processing

6. Tokenize: return words with position

7. ChineseAnalyzer for Whoosh

8. Command Line Interface

Initialization

Using Other Dictionaries

Segmentation speed

總結

2. “臺中”總是被切成“臺中”？（以及類似情況）

3. “今天天氣不錯”應該被切成“今天天氣不錯”？（以及類似情況）

　Load dictionary