當前位置：首頁 > 编程语言 > python >内容正文

python

python 英语分词_自然语言处理 | NLTK英文分词尝试

發布時間：2023/12/15 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 英语分词_自然语言处理 | NLTK英文分词尝试小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

NLTK是一個高效的Python構建的平臺，用來處理自然語言數據，它提供了易于使用的接口，通過這些接口可以訪問超過50個語料庫和詞匯資源（如WordNet），還有一套用于分類、標記化、詞干標記、解析和語義推理的文本處理庫。NLTK可以在Windows、Mac OS以及Linux系統上使用。

1.安裝NLTK

使用pip install nltk命令安裝NLTK庫，NLTK中集成了語料與模型等的包管理器，通過在python解釋器中執行以下代碼

import nltk

nltk.download()

便會彈出包管理界面，在管理器中可以下載語料，預訓練的模型等。

除了一些個人數據包還可以下載整個集合（使用“all”），或者僅下載書中例子和練習中使用到的數據（使用“book”），或者僅下載沒有語法和訓練模型的語料庫（使用“all-corpora”）。

2.簡單文本分析

分詞

詞性標注

命名實體識別

import nltk

#先分句再分詞

sents = nltk.sent_tokenize("And now for something completely different. I love you.")

word = []

for sent in sents:

word.append(nltk.word_tokenize(sent))

print(word)

#分詞

text = nltk.word_tokenize("And now for something completely different.")

print(text)

#詞性標注

tagged = nltk.pos_tag(text)

print (tagged[0:6])

#命名實體識別

entities = nltk.chunk.ne_chunk(tagged)

print (entities)

>>>[['And', 'now', 'for', 'something', 'completely', 'different', '.'], ['I', 'love', 'you', '.']]

>>>['And', 'now', 'for', 'something', 'completely', 'different', '.']

>>>[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

>>>(S And/CC now/RB for/IN something/NN completely/RB different/JJ ./.)

3.詞的概率分布類FreqDist

1.利用NLTK的FreqDist方法獲取在文本中每個出現的標識符的頻數：

import nltk

from nltk.book import *

# 打開文件

f = open("D:\\App\\test.txt","r")

text = ""

line = f.readline()

while line:

text += line

line = f.readline()

f.close()

text1 = nltk.word_tokenize(text)

# 或者

text1 = nltk.word_tokenize("And now for something completely different. I love you. This is my friend. You are my friend.")

# FreqDist()獲取在文本中每個出現的標識符的頻率分布

fdist = FreqDist(text1)

print(fdist)

# 詞數量

print(fdist.N())

# 不重復詞的數量

print(fdist.B())

>>>

2.獲取頻率&頻數：

# 獲取頻率

print(fdist.freq('friend') * 100)

# 獲取頻數

print(fdist['friend'])

#出現次數最多的詞

fdist.max()

>>>9.523809523809524

'.'

3.繪制前5個標識符，并出現次數累加：

fdist.tabulate(5, cumulative=True)

# 繪圖

fdist.plot(5,cumulative=True)

>>> . my friend And now

4 6 8 9 10

4.詞組統計：

text = "I want go to the Great Wall. This is my friend. You are my friend. The Great Wall in China."

text1 = nltk.word_tokenize(text)

bgrams = nltk.bigrams(text1) #返回一個generate

bgfdist = FreqDist(list(bgrams)) #返回搭配的頻率

bgfdist.plot(10) #查看前10個出現頻率最高的搭配

4.詞形還原

但是統計英文詞頻需要考慮到詞形變化，例如move和moved、is和are應該歸為一個詞而不應該分為兩個詞來統計，所以需要對各種形式的單詞進行詞形還原（lemmatization）。

其中詞形還原需要用到NLTK。在NLTK中，tag用來描述一個單詞的詞性（含詞形、時態等概念），分詞結果傳入nltk.pos_tag方法，可以獲得每一個單詞的詞性，如：('John', 'NNP')表示”John”的詞性是NNP(Proper noun, singular)，即專有名詞單數。

進行詞形還原并統計詞頻的完整代碼：

import sys,re,collections,nltk

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.tokenize import word_tokenize

# 正則表達式過濾特殊符號用空格符占位，雙引號、單引號、句點、逗號

pat_letter = re.compile(r'[^a-zA-Z \']+')

# 還原常見縮寫單詞

pat_s = re.compile("(?<=[a-zA-Z])\'s") # 找出字母后面的字母

pat_s2 = re.compile("(?<=s)\'s?")

pat_not = re.compile("(?<=[a-zA-Z])n\'t") # not的縮寫

pat_would = re.compile("(?<=[a-zA-Z])\'d") # would的縮寫

pat_will = re.compile("(?<=[a-zA-Z])\'ll") # will的縮寫

pat_am = re.compile("(?<=[I|i])\'m") # am的縮寫

pat_are = re.compile("(?<=[a-zA-Z])\'re") # are的縮寫

pat_ve = re.compile("(?<=[a-zA-Z])\'ve") # have的縮寫

lmtzr = WordNetLemmatizer()

def replace_abbreviations(text):

new_text = text

new_text = pat_letter.sub(' ', text).strip().lower()

new_text = pat_is.sub(r"\1 is", new_text)

new_text = pat_s.sub("", new_text)

new_text = pat_s2.sub("", new_text)

new_text = pat_not.sub(" not", new_text)

new_text = pat_would.sub(" would", new_text)

new_text = pat_will.sub(" will", new_text)

new_text = pat_am.sub(" am", new_text)

new_text = pat_are.sub(" are", new_text)

new_text = pat_ve.sub(" have", new_text)

new_text = new_text.replace('\'', ' ')

return new_text

# pos和tag有相似的地方，通過tag獲得pos

def get_wordnet_pos(treebank_tag):

if treebank_tag.startswith('J'):

return nltk.corpus.wordnet.ADJ

elif treebank_tag.startswith('V'):

return nltk.corpus.wordnet.VERB

elif treebank_tag.startswith('N'):

return nltk.corpus.wordnet.NOUN

elif treebank_tag.startswith('R'):

return nltk.corpus.wordnet.ADV

else:

return ''

def merge(words):

new_words = []

for word in words:

if word:

tag = nltk.pos_tag(word_tokenize(word)) # tag is like [('bigger', 'JJR')]

pos = get_wordnet_pos(tag[0][1])

if pos:

# lemmatize()方法將word單詞還原成pos詞性的形式

lemmatized_word = lmtzr.lemmatize(word, pos)

new_words.append(lemmatized_word)

else:

new_words.append(word)

return new_words

def get_words(file):

with open (file) as f:

words_box=[]

# pat = re.compile(r'[^a-zA-Z \']+') # 過濾特殊符號

for line in f:

#if re.match(r'[a-zA-Z]*',line):

# words_box.extend(line.strip().strip('\'\"\.,').lower().split())

# words_box.extend(pat.sub(' ', line).strip().lower().split())

words_box.extend(merge(replace_abbreviations(line).split()))

return collections.Counter(words_box) # 返回單詞和詞頻

def append_ext(words):

new_words = []

for item in words:

word, count = item

tag = nltk.pos_tag(word_tokenize(word))[0][1] # tag is like [('bigger', 'JJR')]

new_words.append((word, count, tag))

return new_words

# 將統計結果寫入文件

def write_to_file(words, file="D:\\App\\result.txt"):

f = open(file, 'w')

for item in words:

for field in item:

f.write(str(field)+',')

f.write('\n')

if __name__=='__main__':

print ("counting...")

words = get_words("D:\\App\\test.txt")

print ("writing file...")

write_to_file(append_ext(words.most_common()))

統計結果：

除NLTK之外，stanford提供的CoreNLP自然語言處理工具包也常被使用，有需要的可以進一步了解。

CoreNLP版本：3.9.1，下載stanford-corenlp-full-2018-02-27.zip壓縮包

功能：

分詞（tokenize）、分句（split）

詞性標注（pos）

詞形還原（lemma）

命名實體識別（ner）

語法解析（parse）

情感分析（sentiment）

支持語言：中文、英文、法語、德語、西班牙語、阿拉伯語等。

編程要求：Java1.8+

總結

以上是生活随笔為你收集整理的python 英语分词_自然语言处理 | NLTK英文分词尝试的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【计算机网络】期末复习试题
下一篇： python 梳理：安装并开始使用