生活随笔
收集整理的這篇文章主要介紹了
自然语言处理系列之:中文分词技术
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
大綱
3.1 中文分詞簡介
規(guī)則分詞
最早興起,主要通過人工設(shè)立詞庫,按照一定方式進行匹配切分,實現(xiàn)簡單高效,但對新詞難以處理;
統(tǒng)計分詞
能較好應(yīng)對新詞發(fā)現(xiàn)場景,但是太過于依賴于語料質(zhì)量;
混合分詞
規(guī)則分詞與統(tǒng)計分詞的結(jié)合體;
3.2 規(guī)則分詞
-
定義
一種機械分詞方法,主要通過維護詞典,切分語句時,將語句中的每個字符串與詞表中的詞逐一匹配,找到則切分,否則不切分;
-
分類
- 正向最大匹配法(Maximum Match Method,MM法)
-
基本思想
假定分詞詞典中最長詞有iii個漢字字符,則用被處理文檔的當前字符串中的前iii個字作為匹配字段,查找字典;
-
算法描述
- 從左向右取待切分漢語句的mmm個字符作為匹配字段,mmm是機器詞典中最長詞條的字符數(shù);
- 查找機器詞典并進行匹配。匹配成功則將匹配字段作為一個詞切分出來,匹配失敗則將匹配字段的最后一個字去掉,剩下的字符串作為新的匹配字段,進行再匹配,一直重復(fù)上述過程指導(dǎo)切分出所有詞;
- 逆向最大匹配法(Reverse Maximum Match Method,RMM法)
-
基本原理
從被處理文檔末端開始匹配掃描,每次取末端的iii個字符作為匹配字段,匹配事變則去掉匹配字段最前一個字,繼續(xù)匹配;
- 雙向最大匹配(Bi-direction Matching Method)
-
基本原理
將正向最大匹配法和逆向最大匹配法得到的分詞結(jié)果進行比較,庵后按照最大匹配原則,選取詞數(shù)切分最少的作為結(jié)果;
-
相關(guān)代碼
正向最大匹配
train_data
= './data/train.txt'
test_data
= './data/test.txt'
result_data
= './data/test_sc_zhengxiang.txt' def get_dic(train_data
): with open(train_data
, 'r', encoding
='utf-8', ) as f
:try:file_content
= f
.read
().split
()finally:f
.close
()chars
= list(set(file_content
))return chars
def MM(test_data
, result_data
, dic
):max_length
= 5h
= open(result_data
, 'w', encoding
='utf-8', )with open(test_data
, 'r', encoding
='utf-8', ) as f
:lines
= f
.readlines
()for line
in lines
: max_length
= 5my_list
= []len_hang
= len(line
)while len_hang
> 0:tryWord
= line
[0:max_length
]while tryWord
not in dic
:if len(tryWord
) == 1:breaktryWord
= tryWord
[0:len(tryWord
) - 1]my_list
.append
(tryWord
)line
= line
[len(tryWord
):]len_hang
= len(line
)for t
in my_list
: if t
== '\n':h
.write
('\n')else:print(t
)h
.write
(t
+ " ")h
.close
()if __name__
== '__main__':print('讀入詞典')dic
= get_dic
(train_data
)print('開始匹配')MM
(test_data
, result_data
, dic
)
逆向最大匹配
train_data
= './data/train.txt'
test_data
= './data/test.txt'
result_data
= './data/test_sc.txt'def get_dic(train_data
):with open(train_data
, 'r', encoding
='utf-8', ) as f
:try:file_content
= f
.read
().split
()finally:f
.close
()chars
= list(set(file_content
))return chars
def RMM(test_data
,result_data
,dic
):max_length
= 5h
= open(result_data
, 'w', encoding
='utf-8', )with open(test_data
, 'r', encoding
='utf-8', ) as f
:lines
= f
.readlines
()for line
in lines
:my_stack
= []len_hang
= len(line
)while len_hang
> 0:tryWord
= line
[-max_length
:]while tryWord
not in dic
:if len(tryWord
) == 1:breaktryWord
= tryWord
[1:]my_stack
.append
(tryWord
)line
= line
[0:len(line
) - len(tryWord
)]len_hang
= len(line
)while len(my_stack
):t
= my_stack
.pop
()if t
== '\n':h
.write
('\n')else:print(t
)h
.write
(t
+ " ")h
.close
()if __name__
== '__main__':print('獲取字典')dic
= get_dic
(train_data
)print('開始匹配……')RMM
(test_data
,result_data
,dic
)
3.3 統(tǒng)計分詞
-
主要操作
- 建立統(tǒng)計語言模型;
- 對句子進行單詞劃分,然后對劃分結(jié)果進行概率計算,獲得概率最大額分詞方式,常用統(tǒng)計學(xué)習(xí)算法有隱含馬爾可夫(HMM)、條件隨機場(CRF)等;
-
n元條件概率
P(wi∣wi?(n?1),…,wi?1)=count(wi?(n?1),…,wi?1,wi)count(wi?(n?1),…,wi?1)P(w_i|w_{i-(n-1)},…,w_{i-1})=\frac{count(w_{i-(n-1)},…,w_{i-1},w_i)}{count(w_{i-(n-1)},…,w_{i-1})}P(wi?∣wi?(n?1)?,…,wi?1?)=count(wi?(n?1)?,…,wi?1?)count(wi?(n?1)?,…,wi?1?,wi?)?
其中,count(wi?(n?1),…,wi?1)count(w_{i-(n-1)},…,w_{i-1})count(wi?(n?1)?,…,wi?1?)表示詞wi?(n?1),…,wi?1w_{i-(n-1)},…,w_{i-1}wi?(n?1)?,…,wi?1?在語料庫中出現(xiàn)的總次數(shù);
數(shù)學(xué)理論
假設(shè)用λ=λ1λ2…λn\lambda=\lambda _1 \lambda_2 …\lambda_nλ=λ1?λ2?…λn?代表輸入的句子,nnn表示句子長度,λi\lambda_iλi?表示字,o=o1o2…ono=o_1o_2…o_no=o1?o2?…on?表示輸出的標簽,則理想輸出為:
max=maxP(o=o1o2…on∣λ=λ1λ2…λn)=P(o1∣λ1)P(o2∣λ2)…P(on∣λn)max = maxP(o=o_1o_2…o_n|\lambda=\lambda _1 \lambda_2 …\lambda_n)=P(o_1|\lambda_1)P(o_2|\lambda_2)…P(o_n|\lambda_n)max=maxP(o=o1?o2?…on?∣λ=λ1?λ2?…λn?)=P(o1?∣λ1?)P(o2?∣λ2?)…P(on?∣λn?)
在這個算法中,求解結(jié)果的常用方法是Veterbi算法,這是一種動態(tài)規(guī)劃算法,核心是:如果最終的最優(yōu)路徑經(jīng)過某個節(jié)點oio_ioi?,則從初始節(jié)點到oi?1o_{i-1}oi?1?點的路徑必然也是一個最優(yōu)路徑,因為每個節(jié)點oio_ioi?只會影響前后兩個P(oi?1∣oi)P(o_{i-1}|o_i)P(oi?1?∣oi?)和P(oi∣oi+1)P(o_i|o_{i+1})P(oi?∣oi+1?);
Python實現(xiàn)
import os
import pickle
class HMM(object):def __init__(self
):import osself
.model_file
= './data/models/hmm_model.pkl'self
.state_list
= ['B', 'M', 'E', 'S']self
.load_para
= Falsedef try_load_model(self
, trained
):if trained
:import pickle
with open(self
.model_file
, 'rb') as f
:self
.A_dic
= pickle
.load
(f
)self
.B_dic
= pickle
.load
(f
)self
.Pi_dic
= pickle
.load
(f
)self
.load_para
= Trueelse:self
.A_dic
= {}self
.B_dic
= {}self
.Pi_dic
= {}self
.load_para
= Falsedef train(self
, path
):self
.try_load_model
(False)Count_dic
= {}def init_parameters():for state
in self
.state_list
:self
.A_dic
[state
] = {s
: 0.0 for s
in self
.state_list
}self
.Pi_dic
[state
] = 0.0self
.B_dic
[state
] = {}Count_dic
[state
] = 0def makeLabel(text
):out_text
= []if len(text
) == 1:out_text
.append
('S')else:out_text
+= ['B'] + ['M'] * (len(text
) - 2) + ['E']return out_textinit_parameters
()line_num
= -1words
= set()with open(path
, encoding
='utf8') as f
:for line
in f
:line_num
+= 1print(line
)line
= line
.strip
()if not line
:continueword_list
= [i
for i
in line
if i
!= ' ']words
|= set(word_list
) linelist
= line
.split
()line_state
= []for w
in linelist
:line_state
.extend
(makeLabel
(w
))assert len(word_list
) == len(line_state
)for k
, v
in enumerate(line_state
):Count_dic
[v
] += 1if k
== 0:self
.Pi_dic
[v
] += 1 else:self
.A_dic
[line_state
[k
- 1]][v
] += 1 self
.B_dic
[line_state
[k
]][word_list
[k
]] = \self
.B_dic
[line_state
[k
]].get
(word_list
[k
], 0) + 1.0 self
.Pi_dic
= {k
: v
* 1.0 / line_num
for k
, v
in self
.Pi_dic
.items
()}self
.A_dic
= {k
: {k1
: v1
/ Count_dic
[k
] for k1
, v1
in v
.items
()}for k
, v
in self
.A_dic
.items
()}self
.B_dic
= {k
: {k1
: (v1
+ 1) / Count_dic
[k
] for k1
, v1
in v
.items
()}for k
, v
in self
.B_dic
.items
()}import pickle
with open(self
.model_file
, 'wb') as f
:pickle
.dump
(self
.A_dic
, f
)pickle
.dump
(self
.B_dic
, f
)pickle
.dump
(self
.Pi_dic
, f
)return self
def viterbi(self
, text
, states
, start_p
, trans_p
, emit_p
):V
= [{}]path
= {}for y
in states
:V
[0][y
] = start_p
[y
] * emit_p
[y
].get
(text
[0], 0)path
[y
] = [y
]for t
in range(1, len(text
)):V
.append
({})newpath
= {}neverSeen
= text
[t
] not in emit_p
['S'].keys
() and \text
[t
] not in emit_p
['M'].keys
() and \text
[t
] not in emit_p
['E'].keys
() and \text
[t
] not in emit_p
['B'].keys
()for y
in states
:emitP
= emit_p
[y
].get
(text
[t
], 0) if not neverSeen
else 1.0 (prob
, state
) = max([(V
[t
- 1][y0
] * trans_p
[y0
].get
(y
, 0) *emitP
, y0
)for y0
in states
if V
[t
- 1][y0
] > 0])V
[t
][y
] = probnewpath
[y
] = path
[state
] + [y
]path
= newpath
if emit_p
['M'].get
(text
[-1], 0) > emit_p
['S'].get
(text
[-1], 0):(prob
, state
) = max([(V
[len(text
) - 1][y
], y
) for y
in ('E', 'M')])else:(prob
, state
) = max([(V
[len(text
) - 1][y
], y
) for y
in states
])return (prob
, path
[state
])def cut(self
, text
):import os
if not self
.load_para
:self
.try_load_model
(os
.path
.exists
(self
.model_file
))prob
, pos_list
= self
.viterbi
(text
, self
.state_list
, self
.Pi_dic
, self
.A_dic
, self
.B_dic
)begin
, next = 0, 0for i
, char
in enumerate(text
):pos
= pos_list
[i
]if pos
== 'B':begin
= i
elif pos
== 'E':yield text
[begin
: i
+ 1]next = i
+ 1elif pos
== 'S':yield char
next = i
+ 1if next < len(text
):yield text
[next:]if __name__
== '__main__':hmm
= HMM
()hmm
.train
('./data/trainCorpus.txt_utf8')print('訓(xùn)練完成')text
= input('輸入一個句子:')result
= hmm
.cut
(text
)print(str(list(result
)))
-
其他統(tǒng)計分詞算法
- 條件隨機場(CRF)
- 神經(jīng)網(wǎng)絡(luò)分詞算法(CNN、LSTM)
3.5 中文分詞工具——Jieba
import jieba
sent
= '中文分詞是文本處理中不可獲取的一步!'seg_list
= jieba
.cut
(sent
,cut_all
=True)
print('全模式: ', '/'.join
(seg_list
))seg_list
= jieba
.cut
(sent
,cut_all
=False)
print('精確模式: ', '/'.join
(seg_list
))seg_list
= jieba
.cut_for_search
(sent
)
print('搜索引擎模式: ', '/'.join
(seg_list
))
import glob
import random
import jieba
def get_content(path
):with open(path
, 'r', encoding
='gbk', errors
='ignore') as f
:content
= ''for l
in f
:l
= l
.strip
()content
+= l
return content
def get_TF(words
, topK
=10):tf_dic
= {}for w
in words
:tf_dic
[w
] = tf_dic
.get
(w
, 0) + 1return sorted(tf_dic
.items
(), key
=lambda x
: x
[1], reverse
=True)[:topK
]
def stop_words(path
):with open(path
) as f
:return [l
.strip
() for l
in f
]
def main(path
,stop_words_path
):files
= glob
.glob
(path
)corpus
= [get_content
(x
) for x
in files
[:5]]sample_inxs
= random
.randint
(0, len(corpus
))for sample_inx
in range(sample_inxs
):split_words
= [x
for x
in jieba
.cut
(corpus
[sample_inx
]) if x
not in stop_words
(stop_words_path
)]print('樣本之一:' + corpus
[sample_inx
])print('樣本分詞效果:' + '/ '.join
(split_words
))print('樣本的topK(10)詞:' + str(get_TF
(split_words
)))if __name__
== '__main__':path
= './data/news/C000010/*.txt'stop_words_path
='./data/stop_words.utf8'main
(path
, stop_words_path
)
總結(jié)
以上是生活随笔為你收集整理的自然语言处理系列之:中文分词技术的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。