當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLTK2：词性标注

發(fā)布時間：2023/12/18 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLTK2：词性标注小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1. 使用NLTK對英文進(jìn)行詞性標(biāo)注
- 1.1詞性標(biāo)注示例
- 1.2 語料庫的已標(biāo)注數(shù)據(jù)
2 標(biāo)注器
- 2.1 默認(rèn)標(biāo)注器
- 2.2 正則表達(dá)式標(biāo)注器
- 2.3 查詢標(biāo)注器
3 訓(xùn)練N-gram標(biāo)注器
- 3.1 一般N-gram標(biāo)注
- 3.2 組合標(biāo)注器
4.更進(jìn)一步
5.中文標(biāo)注器的訓(xùn)練
6. brown語料庫相關(guān)方法

參考鏈接2
參考鏈接3
參考鏈接1

自然語言是人類在溝通中形成的一套規(guī)則體系。規(guī)則有強(qiáng)有弱，比如非正式場合使用口語，正式場合下的書面語。要處理自然語言，也要遵循這些形成的規(guī)則，否則就會得出令人無法理解的結(jié)論。下面介紹一些術(shù)語的簡單區(qū)別。
文法：等同于語法(grammar)，文章的書寫規(guī)范，用來描述語言及其結(jié)構(gòu)，它包含句法和詞法規(guī)范。
句法：Syntax，句子的結(jié)構(gòu)或成分的構(gòu)成與關(guān)系的規(guī)范。
詞法：Lexical，詞的構(gòu)詞，變化等的規(guī)范。

詞性標(biāo)注，或POS(Part Of Speech)，是一種分析句子成分的方法，通過它來識別每個詞的詞性。

下面簡要列舉POS的tagset含意，詳細(xì)可看nltk.help.brown_tagset()

標(biāo)記詞性示例

ADJ	形容詞	new, good, high, special, big, local
ADV	動詞	really, already, still, early, now
CONJ	連詞	and, or, but, if, while, although
DET	限定詞	the, a, some, most, every, no
EX	存在量詞	there, there’s
MOD	情態(tài)動詞	will, can, would, may, must, should
NN	名詞	year,home,costs,time
NNP	專有名詞	April，China，Washington
NUM	數(shù)詞	fourth，2016, 09:30
PRON	代詞	he,they,us
P	介詞	on,over,with,of
TO	詞to	to
UH	嘆詞	ah,ha,oops
VB		動詞
VBD	動詞過去式	made,said,went
VBG	現(xiàn)在分詞	going,lying,playing
VBN	過去分詞	taken,given,gone
WH	wh限定詞	who,where,when,what

1. 使用NLTK對英文進(jìn)行詞性標(biāo)注

1.1詞性標(biāo)注示例

import nltksent = "I am going to Beijing tomorrow."""" nltk.sent_tokenize(text) #按句子分割 ,python3分不開句子 nltk.word_tokenize(sentence) #分詞 nltk的分詞是句子級別的，所以對于一篇文檔首先要將文章按句子進(jìn)行分割，然后句子進(jìn)行分詞： """ '\nnltk.sent_tokenize(text) #按句子分割 ,python3分不開句子\nnltk.word_tokenize(sentence) #分詞 \nnltk的分詞是句子級別的，所以對于一篇文檔首先要將文章按句子進(jìn)行分割，然后句子進(jìn)行分詞： \n' # 分割句子 words = nltk.word_tokenize(sent) print(words) ['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.'] # 詞性標(biāo)注 taged_sent = nltk.pos_tag(words) taged_sent [('I', 'PRP'),('am', 'VBP'),('going', 'VBG'),('to', 'TO'),('Beijing', 'NNP'),('tomorrow', 'NN'),('.', '.')]

1.2 語料庫的已標(biāo)注數(shù)據(jù)

語料類提供了下列方法可以返回預(yù)標(biāo)注數(shù)據(jù)。

方法說明

tagged_words(fileids,categories)	返回標(biāo)注數(shù)據(jù)，以詞列表的形式
tagged_sents(fileids,categories)	返回標(biāo)注數(shù)據(jù)，以句子列表形式
tagged_paras(fileids,categories)	返回標(biāo)注數(shù)據(jù)，以文章列表形式

2 標(biāo)注器

2.1 默認(rèn)標(biāo)注器

最簡單的詞性標(biāo)注器是將所有詞都標(biāo)注為名詞NN。這種標(biāo)注器沒有太大的價值。正確率很低。下面演示NLTK提供的默認(rèn)標(biāo)注器的用法。

import nltk from nltk.corpus import brown # 加載數(shù)據(jù) brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), brown_sents = brown.sents(categories='news') # brown_tagged_sents # 最簡單的標(biāo)注器是為每個標(biāo)識符分配同樣的標(biāo)記。這似乎是一個相對普通的方法， # 但為標(biāo)注器的性能建立了一個重要的標(biāo)準(zhǔn)。為了得到最好的效果，我們用最有可能的標(biāo)記標(biāo)注每個詞。 # 通過下例找出哪個標(biāo)記是最有可能的。 tags = [tag for (word,tag) in brown.tagged_words(categories='news')] tags ['AT','NP-TL','NN-TL','JJ-TL','NN-TL','VBD','NR','AT','NN','IN','NP$','JJ','NN','NN','VBD','``','AT','NN',"''",'CS','DTI','NNS','VBD','NN', ...,'IN','NN','.','NP','NPS','BER','VBG','JJ','NN','TO','VB','AT','NN','IN','AT','CD','NN$',...] tag = nltk.FreqDist(tags).max() tag 'NN' # 我們現(xiàn)在可以創(chuàng)建一個將所有詞都標(biāo)注為NN的標(biāo)注器。 default_tagger = nltk.DefaultTagger('NN') sent = "I am going to Beijing tomorrow." default_tagger.tag(nltk.word_tokenize(sent)) [('I', 'NN'),('am', 'NN'),('going', 'NN'),('to', 'NN'),('Beijing', 'NN'),('tomorrow', 'NN'),('.', 'NN')] default_tagger.evaluate(brown_tagged_sents) 0.13089484257215028

2.2 正則表達(dá)式標(biāo)注器

正則表達(dá)式標(biāo)注器基于匹配模式分配標(biāo)記給標(biāo)識符。例如，一般情況下認(rèn)為任一以ed結(jié)尾的詞都是動詞過去分詞，任一以‘s結(jié)尾的詞都是名詞所有格。下例中可以用正則表達(dá)式的列表來表示這些。

patterns = [(r'.*ing$', 'VBG'), # gerunds(r'.*ed$', 'VBD'), # simple past(r'.*es$', 'VBZ'), # 3rd singular present(r'.*ould$', 'MD'), # modals(r'.*\'s$', 'NN$'), # possessive nouns(r'.*s$', 'NNS'), # plural nouns(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers(r'.*', 'NN') # nouns (deafult) ]

這些是按順序處理的，第一個匹配上的會被使用。現(xiàn)在建立一個標(biāo)注器，并用它來標(biāo)注句子。

regexp_tagger = nltk.RegexpTagger(patterns) regexp_tagger.tag(brown_sents[3]) regexp_tagger.evaluate(brown_tagged_sents) # 0.20326391789486245 # 大約有五分之一是正確的 0.20326391789486245

2.3 查詢標(biāo)注器

很多高頻詞沒有NN標(biāo)記，我們找出100個最頻繁的詞，存儲它們最有可能的標(biāo)記，然后我們可以使用這個信息作為“查詢標(biāo)注器（NLTKUnigramTagger）”的模型，如下例：

# 先把詞拿出來 fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]# 收集了在不同條件下運(yùn)行的單個實(shí)驗(yàn)的頻率分布。條件頻率分布用于記錄每個樣本在給定的實(shí)驗(yàn)條件下出現(xiàn)的次數(shù)。 # 例如，可以使用條件頻率分布來記錄文檔中給定長度的每個單詞(類型)的頻率。 # 在形式上，條件頻率分布可以定義為一個函數(shù)，將每個條件映射到實(shí)驗(yàn)條件下的FreqDist。 cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) # print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})), # 頻繁詞top100 most_freq_words = fd.keys()# python 3.6 以上，dict_keys 類型需要list轉(zhuǎn)化 most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',# 字典生成式對于top100的單詞，取該單詞頻率分布最高的詞性，作為該詞的詞性 likely_tags = dict((word,cfd[word].max()) for word in most_freq_words) # likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',# UnigramTagger為訓(xùn)練語料庫中的每個單詞找到最有可能的標(biāo)記，然后使用該信息為新標(biāo)記分配標(biāo)記。 baseline_tagger = nltk.UnigramTagger(model = likely_tags)baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312 # brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...] baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括號轉(zhuǎn)二維數(shù)組 0.3329355371243312 baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 個別語句會有極高的準(zhǔn)確率 0.972972972972973 0.972972972972973

此處結(jié)果與書中不同，書中結(jié)果為0.45左右，即僅僅知道100個最頻繁的詞的標(biāo)記就能正確標(biāo)注很大一部分標(biāo)識符。

來看看它在未標(biāo)注的輸入文本是運(yùn)行得怎么樣：

sent = brown.sents(categories='news')[10] # baseline_tagger.tag(sent) [('It', 'PPS'),('urged', None),('that', 'CS'),('the', 'AT'),('city', 'NN'),('``', '``'),('take', None),('steps', None),('to', 'TO'),('remedy', None),("''", "''"),('this', 'DT'),('problem', None),('.', '.')]

可以看到很多詞都被分配了’None’標(biāo)簽，因?yàn)樗鼈儾辉?00個最頻繁的詞中。這種情況我們想分配默認(rèn)標(biāo)記NN。也就是說，我們應(yīng)先使用查找表，如果不能指定就使用默認(rèn)標(biāo)注器，這個過程叫“回退”。

# 設(shè)置默認(rèn)標(biāo)注器，在找不到匹配時使用 baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))

最后我們把查找標(biāo)注器和默認(rèn)標(biāo)注器結(jié)合起來之后，看它的性能如何，使用大小不同的模型：

def performance(cfd,wordlist):lt = dict((word,cfd[word].max()) for word in wordlist)baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))def display():import pylabwords_by_freq = list(nltk.FreqDist(brown.words(categories='news')))cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))sizes = 2 ** pylab.arange(16)prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]pylab.plot(sizes,prefs,'-bo')pylab.title('Lookup Tagger Performance with Varying Model Size')pylab.xlabel('Model Size')pylab.ylabel('Performace')pylab.show()display()

可以看到隨著模型規(guī)模的增長，最初性能增加較快，最終達(dá)到穩(wěn)定水平，這時哪怕模型規(guī)模再增加，性能提升幅度也很小

3 訓(xùn)練N-gram標(biāo)注器

3.1 一般N-gram標(biāo)注

在上一節(jié)中，已經(jīng)使用了1-Gram，即Unigram標(biāo)注器。考慮更多的上下文，便有了2/3-gram，這里統(tǒng)稱為N-gram。注意，更長的上正文并不能帶來準(zhǔn)確度的提升。
除了向N-gram標(biāo)注器提供詞表模型，另外一種構(gòu)建標(biāo)注器的方法是訓(xùn)練。N-gram標(biāo)注器的構(gòu)建函數(shù)如下：init(train=None, model=None, backoff=None),可以將標(biāo)注好的語料作為訓(xùn)練數(shù)據(jù)，用于構(gòu)建一個標(biāo)注器。

import nltk from nltk.corpus import brownbrown_tagged_sents = brown.tagged_sents(categories = 'news') train_num = int(len(brown_tagged_sents) * 0.9) x_train = brown_tagged_sents[0:train_num] x_test = brown_tagged_sents[train_num:] tagger = nltk.UnigramTagger(train = x_train) print(tagger.evaluate(x_test)) # 0.8121200039868434 0.8121200039868434

對于UniGram，使用90%的數(shù)據(jù)進(jìn)行訓(xùn)練，在余下10%的數(shù)據(jù)上測試的準(zhǔn)確率為81%。如果改為BiGram，則正確率會下降到10%左右。

3.2 組合標(biāo)注器

可以利用backoff參數(shù)，將多個組合標(biāo)注器組合起來，以提高識別精確率。

import nltk from nltk.corpus import brown pattern = [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'.*', 'NN') #未匹配的仍標(biāo)注為NN ] brown_tagged_sents = brown.tagged_sents(categories = 'news') train_num = int(len(brown_tagged_sents) * 0.9) x_train = brown_tagged_sents[0:train_num] x_test = brown_tagged_sents[train_num:]t0 = nltk.RegexpTagger(pattern) t1 = nltk.UnigramTagger(x_train,backoff = t0) t2 = nltk.BigramTagger(x_train,backoff = t1) print(t2.evaluate(x_test)) # 0.8627529153792485 0.8627529153792485

從上面可以看出，不需要任何的語言學(xué)知識，只需要借助統(tǒng)計(jì)數(shù)據(jù)便可以使得詞性標(biāo)注做的足夠好。
對于中文，只要有標(biāo)注語料，也可以按照上面的過程訓(xùn)練N-gram標(biāo)注器。

4.更進(jìn)一步

nltk.tag.BrillTagger實(shí)現(xiàn)了基于轉(zhuǎn)換的標(biāo)注，在基礎(chǔ)標(biāo)注器的結(jié)果上，對輸出進(jìn)行基于規(guī)則的修正，實(shí)現(xiàn)更高的準(zhǔn)確度。

import nltk import nltk.tag.brill from nltk.corpus import brownpattern = [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'.*', 'NN') #未匹配的仍標(biāo)注為NN ] # 劃分?jǐn)?shù)據(jù)集 brown_tagged_sents = brown.tagged_sents(categories = ['news']) train_num = int(len(brown_tagged_sents)*0.9) x_train = brown_tagged_sents[:train_num] x_test = brown_tagged_sents[train_num:] # baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern)) tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24()) brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99) # 評估 print(brill_tagger.evaluate(x_test))# 0.8683344961626632 0.8683344961626632 brown_sents = brown.sents(categories="news") print(brown_tagged_sents[2007]) print(brill_tagger.tag(brown_sents[2007])) [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')] [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

5.中文標(biāo)注器的訓(xùn)練

下面基于Unigram訓(xùn)練一個中文詞性標(biāo)注器，語料使用網(wǎng)上可以下載得到的人民日報(bào)98年1月的標(biāo)注資料。

import nltk import jsonlines = open('./詞性標(biāo)注人民日報(bào)199801.txt',encoding = 'utf-8').readlines() all_tagged_sents = []for line in lines:sent = line.split()tagged_sent = []for item in sent:pair = nltk.str2tuple(item)tagged_sent.append(pair)if len(tagged_sent)>0:all_tagged_sents.append(tagged_sent)train_size = int(len(all_tagged_sents)*0.8) x_train = all_tagged_sents[:train_size] x_test = all_tagged_sents[train_size:]tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n')) print(tagger.evaluate(x_test)) # 0.8714095491725319 """ line: 19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀(jì)/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w line.split(): '\nline:\n19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀(jì)/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w \n'tagged_sent: [('19980101-01-001-001', 'M'), ('邁向', 'V'), ('充滿', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世紀(jì)', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('講話', 'N'), ('（', 'W'), ('附', 'V'), ('圖片', 'N'), ('１', 'M'), ('張', 'Q'), ('）', 'W')] """ 0.8714095491725319"\nline:\n19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀(jì)/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀(jì)/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('邁向', 'V'), ('充滿', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世紀(jì)', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('講話', 'N'), ('（', 'W'), ('附', 'V'), ('圖片', 'N'), ('１', 'M'), ('張', 'Q'), ('）', 'W')]\n"

6. brown語料庫相關(guān)方法

# 語料庫文件名列表 brown.fileids() ['ca01','ca02','ca03','ca04', ...,'cp17','cp18','cp19','cp20','cp21','cp22','cp23','cp24','cp25','cp26','cp27','cp28','cp29','cr01','cr02','cr03','cr04','cr05','cr06','cr07','cr08','cr09'] # 返回指定類別('news')的文件名列表 brown.fileids('news') ['ca01','ca02','ca03','ca04','ca05','ca06','ca07','ca08',...'ca26','ca27','ca28','ca29','ca30','ca31','ca32','ca33','ca34','ca35','ca36','ca37','ca38','ca39','ca40','ca41','ca42','ca43','ca44'] # 返回指定分類的原始文本 brown.raw(categories=['news'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, / deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl
…
Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nn $KaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\?n?T?h?e?/at 22-year-old…$ bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/ Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit

# 返回指定文件名的文本字符串 brown.raw(fileids=['ca01','ca02']) "\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n" # 返回指定文件名的語句列表 brown.sents(fileids=['ca01','ca02']) [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] # 按分類返回語句列表 brown.sents(categories=['news']) [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] # 返回指定文件名的單詞列表 brown.words('ca01') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] # 返回指定分類的單詞列表 brown.words(categories=['news']) ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] # 返回按句子標(biāo)注好詞性的二維數(shù)組 brown.tagged_sents(categories=['news']) [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

總結(jié)

以上是生活随笔為你收集整理的NLTK2：词性标注的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

词性