當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLP原理及基础

發布時間：2023/12/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP原理及基础小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

以NLTK為基礎配合講解自然語言處理的原理

http://www.nltk.org/
Python上著名的自然語?處理庫
自帶語料庫，詞性分類庫
自帶分類，分詞，等功能
強?的社區?持
還有N多的簡單版wrapper，如 TextBlob

NLTK安裝

# Mac/Unix sudo pip install -U nltk # 順便便還可以裝個Numpy sudo pip install -U numpy # 測試是否安裝成功 >>> python >>> import nltk

安裝語料庫

import nltk nltk.download()

速度慢，可以在網頁https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml里找鏈接，用迅雷下載

功能?覽表

NLTK?帶語料庫

>>> from nltk.corpus import brown >>> brown.categories() # 分類 ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> len(brown.sents()) # 一共句子數 57340 >>> len(brown.words()) # 一共單詞數 1161192

文本處理流程

文本 -> 預處理（分詞、去停用詞） -> 特征工程 -> 機器學習算法 -> 標簽

分詞（Tokenize）

把長句?拆成有“意義”的?部件

>>> import nltk >>> sentence = “hello, world" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['hello', ‘,', 'world']

中英文NLP區別：
英文直接使用空格分詞，中文需要專門的方法進行分詞：

中文分詞：

import jieba seg_list = jieba.cut('我來到北京清華大學', cut_all=True) print('Full Mode:', '/'.join(seg_list)) # 全模式 seg_list = jieba.cut('我來到北京清華大學', cut_all=False) print('Default Mode:', '/'.join(seg_list)) # 精確模式 seg_list = jieba.cut('他來到了網易杭研大廈') # 默認是精確模式 print('/'.join(seg_list)) seg_list = jieba.cut_for_search('小明碩士畢業于中國科學院計算所，后在日本京都大學深造') # 搜索引擎模式 print('搜索引擎模式:', '/'.join(seg_list)) seg_list = jieba.cut('小明碩士畢業于中國科學院計算所，后在日本京都大學深造', cut_all=True) print('Full Mode:', '/'.join(seg_list))

Full Mode: 我/來到/北京/清華/清華大學/華大/大學
Default Mode: 我/來到/北京/清華大學
他/來到/了/網易/杭研/大廈 (jieba有新詞發現功能，“杭研”沒有在詞典中，但是也被Viterbi算法識別出來了)
搜索引擎模式: 小明/碩士/畢業/于/中國/科學/學院/科學院/中國科學院/計算/計算所/，/后/在/日本/京都/大學/日本京都大學/深造
Full Mode: 小/明/碩士/畢業/于/中國/中國科學院/科學/科學院/學院/計算/計算所///后/在/日本/日本京都大學/京都/京都大學/大學/深造

其他中文分詞工具：CoreNLP ：java編寫，有命名實體識別、詞性標注、詞語詞干化、語句語法樹的構造還有指代關系等功能

對于社交網絡上的文本，有很多不合語法不合正常邏輯的語言表達：
@某人，表情符號，URL，#話題符號（hashtag）等

如：Twitter上的語句推文
RT @angelababy: love you baby! :D http://ah.love #168cm

如果直接分詞：

from nltk.tokenize import word_tokenizetweet='RT @angelababy: love you baby! :D http://ah.love #168cm' print(word_tokenize(tweet))

[‘RT’, ‘@’, ‘angelababy’, ‘:’, ‘love’, ‘you’, ‘baby’, ‘!’, ‘:’, ‘D’, ‘http’, ‘:’, ‘//ah.love’, ‘#’, ‘168cm’]

需要借助正則表達式，將表情符，網址，話題，@某人等作為一個整體，
對照表：http://www.regexlab.com/zh/regref.htm

import reemoticons_str = r"""(?:[:=;] # 表示眼睛的字符[oO\-]? # 表示鼻子的字符[D\)\]$\]/\\OpP] # 表示嘴的字符)"""regex_str = [emoticons_str,r'<[^>]+>', # HTML tagsr'(?:@[\w_]+)', # @某人r"(?:\#+[\w_]+[\w\'_\-]*[\w]+)", # 話題標簽r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\($,]|(?:%[0-9a-f][0-9a-f]))+', # URLsr'(?:(?:\d+,?)+(?:\.?\d+)?)', # 數字r"(?:[a-z][a-z'\-_]+[a-z])", # 含有- 和’ 的單詞r'(?:[\w_]+)', # 其他r'(?:\S)' # 其他 ]tokens_re = re.compile(r'(' + '|'.join(regex_str) + ')', re.VERBOSE | re.IGNORECASE) emoticon_re=re.compile(r'^'+emoticons_str+'$',re.VERBOSE|re.IGNORECASE)def tokenize(s):return tokens_re.findall(s)def preprocess(s,lowercase=False):tokens=tokenize(s)if lowercase:tokens=[token if emoticon_re.search(token) else token.lower() for token in tokens]return tokenstweet='RT @angelababy: love you baby! :D http://ah.love #168cm' print(preprocess(tweet))

[‘RT’, ‘@angelababy’, ‘:’, ‘love’, ‘you’, ‘baby’, ‘!’, ‘:D’, ‘http://ah.love‘, ‘#168cm’]

紛繁復雜的詞形

Inflection 變化：walk=>walking=>walked 不影響詞性
derivation 引申：nation（noun）=>national(adjective)=>nationalize(verb) 影響詞性

詞形歸一化

Stemming 詞干提取(詞根還原)：把不影響詞性的inflection 的小尾巴砍掉（使用詞典，匹配最長詞）
- walking 砍掉ing=>walk
- walked 砍掉ed=>walk
Lemmatization 詞形歸一(詞形還原)：把各種類型的詞的變形，都歸一為一個形式（使用wordnet）
- went 歸一 => go
- are 歸一 => be

NLTK實現Stemming

from nltk.stem.lancaster import LancasterStemmer lancaster_stemmer=LancasterStemmer() print(lancaster_stemmer.stem('maximum')) print(lancaster_stemmer.stem('multiply')) print(lancaster_stemmer.stem('provision')) print(lancaster_stemmer.stem('went')) print(lancaster_stemmer.stem('wenting')) print(lancaster_stemmer.stem('walked')) print(lancaster_stemmer.stem('national'))

maxim
multiply
provid
went
went
walk
nat

from nltk.stem.porter import PorterStemmer porter_stemmer=PorterStemmer() print(porter_stemmer.stem('maximum')) print(porter_stemmer.stem('multiply')) print(porter_stemmer.stem('provision')) print(porter_stemmer.stem('went')) print(porter_stemmer.stem('wenting')) print(porter_stemmer.stem('walked')) print(porter_stemmer.stem('national'))

maximum
multipli
provis
went
went
walk
nation

from nltk.stem import SnowballStemmer snowball_stemmer=SnowballStemmer("english") print(snowball_stemmer.stem('maximum')) print(snowball_stemmer.stem('multiply')) print(snowball_stemmer.stem('provision')) print(snowball_stemmer.stem('went')) print(snowball_stemmer.stem('wenting')) print(snowball_stemmer.stem('walked')) print(snowball_stemmer.stem('national'))

maximum
multipli
provis
went
went
walk
nation

NLTK實現 Lemmatization

from nltk.stem import WordNetLemmatizer wordnet_lemmatizer=WordNetLemmatizer() print(wordnet_lemmatizer.lemmatize('dogs')) print(wordnet_lemmatizer.lemmatize('churches')) print(wordnet_lemmatizer.lemmatize('aardwolves')) print(wordnet_lemmatizer.lemmatize('abaci')) print(wordnet_lemmatizer.lemmatize('hardrock'))

dog
church
aardwolf
abacus
hardrock

問題：Went v.是go的過去式 n.英文名：溫特
所以增加詞性信息，可使NLTK更好的 Lemmatization

from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() # 沒有POS Tag,默認是NN 名詞 print(wordnet_lemmatizer.lemmatize('are')) print(wordnet_lemmatizer.lemmatize('is')) # 加上POS Tag print(wordnet_lemmatizer.lemmatize('is', pos='v')) print(wordnet_lemmatizer.lemmatize('are', pos='v'))

are
is
be
be

NLTK標注POS Tag

import nltk text=nltk.word_tokenize('what does the beautiful fox say') print(text) print(nltk.pos_tag(text))

[‘what’, ‘does’, ‘the’, ‘beautiful’, ‘fox’, ‘say’]
[(‘what’, ‘WDT’), (‘does’, ‘VBZ’), (‘the’, ‘DT’), (‘beautiful’, ‘JJ’), (‘fox’, ‘NNS’), (‘say’, ‘VBP’)]

詞性符號對照表

CC Coordinating conjunction

CD Cardinal number

DT Determiner

EX Existential there

FW Foreign word

IN Preposition or subordinating conjunction

JJ Adjective

JJR Adjective, comparative

JJS Adjective, superlative

LS List item marker

MD Modal

NN Noun, singular or mass

NNS Noun, plural

NNP Proper noun, singular

NNPS Proper noun, plural

PDT Predeterminer

POS Possessive ending

PRP Personal pronoun

PRP$ Possessive pronoun

RB Adverb

RBR Adverb, comparative

RBS Adverb, superlative

RP Particle

SYM Symbol

TO to

UH Interjection

VB Verb, base form

VBD Verb, past tense

VBG Verb, gerund or present participle

VBN Verb, past participle

VBP Verb, non-3rd person singular present

VBZ Verb, 3rd person singular present

WDT Wh-determiner

WP Wh-pronoun

WP$ Possessive wh-pronoun

WRB Wh-adverb

Stopwords

一千個 He 有一千種指代，一千個 The 有一千種指示
對于注重理解文本【意思】的應用場景來說歧義太多

英文停止詞列表：https://www.ranks.nl/stopwords
NLTK有停用詞列表

import nltk from nltk.corpus import stopwords word_list=nltk.word_tokenize('what does the beautiful fox say') print(word_list ) filter_words=[word for word in word_list if word not in stopwords.words('english')] print(filter_words)

[‘what’, ‘does’, ‘the’, ‘beautiful’, ‘fox’, ‘say’]
[‘beautiful’, ‘fox’, ‘say’]

?條typical的?本預處理流?線

根據具體task 決定，如果是文本查重、寫作風格判斷等，可能就不需要去除停止詞

什么是自然語言處理？

自然語言——> 計算機數據

文本預處理讓我們得到了什么？

NLTK在NLP上的經典應?

情感分析
文本相似度
文本分類

應用：情感分析

最簡單的方法：基于情感詞典（sentiment dictionary）
類似于關鍵詞打分機制

like 1
good 2
bad -2
terrible -3

比如：AFINN-111
http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

import nltk from nltk.corpus import stopwords from nltk.stem import SnowballStemmersnowball_stemmer = SnowballStemmer("english")sentiment_dictionary = {} for line in open('AFINN-111.txt'):word, score = line.split('\t')sentiment_dictionary[word] = int(score)text = 'I went to Chicago yesterday, what a fucking day!' word_list = nltk.word_tokenize(text) # 分詞 words = [(snowball_stemmer.stem(word)) for word in word_list] # 詞干提取,詞形還原最好有詞性，此處先不進行 words = [word for word in word_list if word not in stopwords.words('english')] # 去除停用詞 print('預處理之后的詞：', words) total_score = sum(sentiment_dictionary.get(word, 0) for word in words) print('該句子的情感得分：', total_score) if total_score > 0:print('積極') elif total_score == 0:print('中性') else:print('消極')

預處理之后的詞： [‘I’, ‘went’, ‘Chicago’, ‘yesterday’, ‘,’, ‘fucking’, ‘day’, ‘!’]
該句子的情感得分： -4
消極

缺點：新詞無法處理、依賴人工主觀性、無法挖掘句子深層含義

配上ML的情感分析

from nltk.classify import NaiveBayesClassifier# 隨手造點訓練集 s1 = 'this is a good book' s2 = 'this is a awesome book' s3 = 'this is a bad book' s4 = 'this is a terrible book'def preprocess(s):dic = ['this', 'is', 'a', 'good', 'book', 'awesome', 'bad', 'terrible']return {word: True if word in s else False for word in dic} # 返回句子的詞袋向量表示# 把訓練集給做成標準形式 training_data = [[preprocess(s1), 'pos'],[preprocess(s2), 'pos'],[preprocess(s3), 'neg'],[preprocess(s4), 'neg']]# 喂給model吃 model = NaiveBayesClassifier.train(training_data) # 打出結果 print(model.classify(preprocess('this is a terrible book')))

neg

文本相似度

使用 Bag of Words 元素的頻率表示文本特征

使用余弦定理判斷向量相似度

import nltk from nltk import FreqDistcorpus = 'this is my sentence ' \'this is my life ' \'this is the day'# 根據需要做預處理：tokensize,stemming,lemma,stopwords 等 tokens = nltk.word_tokenize(corpus) print(tokens)# 用NLTK的FreqDist統計一下文字出現的頻率 fdist = FreqDist(tokens) # 類似于一個Dict,帶上某個單詞, 可以看到它在整個文章中出現的次數 print(fdist['is']) # 把最常見的50個單詞拿出來 standard_freq_vector = fdist.most_common(50) size = len(standard_freq_vector) print(standard_freq_vector)# Func:按照出現頻率大小，記錄下每一個單詞的位置 def position_lookup(v):res = {}counter = 0for word in v:res[word[0]] = countercounter += 1return res# 把詞典中每個單詞的位置記錄下來 standard_position_dict = position_lookup(standard_freq_vector) print(standard_position_dict)#新的句子 sentence='this is cool' # 建立一個跟詞典同樣大小的向量 freq_vector=[0]*size # 簡單的預處理 tokens=nltk.word_tokenize(sentence) # 對于新句子里的每個單詞 for word in tokens:try:# 如果在詞典里有，就在標準位置上加1freq_vector[standard_position_dict[word]]+=1except KeyError:continueprint(freq_vector)

[‘this’, ‘is’, ‘my’, ‘sentence’, ‘this’, ‘is’, ‘my’, ‘life’, ‘this’, ‘is’, ‘the’, ‘day’]
3
[(‘this’, 3), (‘is’, 3), (‘my’, 2), (‘sentence’, 1), (‘life’, 1), (‘the’, 1), (‘day’, 1)]
{‘this’: 0, ‘is’: 1, ‘my’: 2, ‘sentence’: 3, ‘life’: 4, ‘the’: 5, ‘day’: 6}
[1, 1, 0, 0, 0, 0, 0]

應用：文本分類

TF-IDF

TF：Term Frequency 衡量一個term 在文檔中出現得有多頻繁。

TF（t）=t出現在文檔中的次數文檔中的term總數

IDF：Inverse Document Frequency ，衡量一個term有多重要。

有些詞出現的很多，但明顯不是很有用，如 ‘is’’the’ ‘and’ 之類的詞。

IDF(t)=loge(文檔總數含有t的文檔總數)

（如果一個詞越常見，那么分母就越大，逆文檔頻率就越小越接近0。所以分母通常加1，是為了避免分母為0（即所有文檔都不包含該詞）。log表示對得到的值取對數。）

如果某個詞比較少見，但是它在這篇文章中多次出現，那么它很可能就反映了這篇文章的特性，正是我們所需要的關鍵詞。

TF?IDF=TF?IDF

NLTK實現TF-IDF

from nltk.text import TextCollection# 首先，把所有的文檔放到TextCollection類中 # 這個類會自動幫你斷句，做統計，做計算 corpus = TextCollection(['this is sentence one','this is sentence two',' is sentence three'])# 直接就能算出tfidf # (term:一句話中的某個term,text:這句話) print(corpus.tf_idf('this', 'this is sentence four'))# 對于每個新句子 new_sentence='this is sentence five' # 遍歷一遍所有的vocabulary中的詞： standard_vocab=['this' 'is' 'sentence' 'one' 'two' 'five'] for word in standard_vocab:print(corpus.tf_idf(word, new_sentence))

得到了 TF-IDF的向量表示后，用ML 模型就行分類即可：

案例：關鍵詞搜索

kaggle競賽題：https://www.kaggle.com/c/home-depot-product-search-relevance

Step1：導入所需

import numpy as np import pandas as pd from sklearn.ensemble import RandomForestRegressor, BaggingRegressor from nltk.stem.snowball import SnowballStemmer

讀入訓練/測試集

df_train = pd.read_csv('../input/train.csv', encoding="ISO-8859-1") df_test = pd.read_csv('../input/test.csv', encoding="ISO-8859-1") df_desc = pd.read_csv('../input/product_descriptions.csv') # 產品介紹

看看數據們都長什么樣子

df_train.head()

df_desc.head()

# 合并數據一起處理 df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True) # 將產品描述根據 product_uid 連接過來 df_all = pd.merge(df_all, df_desc, how='left', on='product_uid') df_all.head()

Step 2: 文本預處理

我們這里遇到的文本預處理比較簡單，因為最主要的就是看關鍵詞是否會被包含。
所以我們統一化我們的文本內容，以達到任何term在我們的數據集中只有一種表達式的效果。

stemmer = SnowballStemmer('english')def str_stemmer(s):return " ".join([stemmer.stem(word) for word in s.lower().split()])def str_common_word(str1, str2):return sum(int(str2.find(word)>=0) for word in str1.split())

接下來，把每一個column都跑一遍，以清潔所有的文本內容

# 對文字列進行詞干提取 df_all['search_term'] = df_all['search_term'].map(lambda x: str_stemmer(x)) df_all['product_title'] = df_all['product_title'].map(lambda x: str_stemmer(x)) df_all['product_description'] = df_all['product_description'].map(lambda x: str_stemmer(x))

Step 3: 自制文本特征

# 關鍵詞的長度 df_all['len_of_query'] = df_all['search_term'].map(lambda x:len(x.split())).astype(np.int64) # 標題中有多少關鍵詞重合 df_all['commons_in_title'] = df_all.apply(lambda x:str_common_word(x['search_term'],x['product_title']), axis=1) # 描述中有多少關鍵詞重合 df_all['commons_in_desc'] = df_all.apply(lambda x:str_common_word(x['search_term'],x['product_description']), axis=1)

把不能被『機器學習模型』處理的column給drop掉

df_all = df_all.drop(['search_term','product_title','product_description'],axis=1)

Step 4: 重塑訓練/測試集

總體處理完之后，再將訓練集合測試集分開

df_train = df_all.loc[df_train.index] df_test = df_all.loc[df_test.index]

記錄下測試集的id
留著上傳的時候能對的上號

test_ids = df_test['id']

分離出y_train

y_train = df_train['relevance'].values

把原集中的label給刪去

X_train = df_train.drop(['id','relevance'],axis=1).values X_test = df_test.drop(['id','relevance'],axis=1).values

Step 5: 建立模型

from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score # 用CV結果保證公正客觀性，調試不同的alpha值 params = [1, 3, 5, 6, 7, 8, 9, 10] test_scores = [] for param in params:clf = RandomForestRegressor(n_estimators=30, max_depth=param)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_scores))

畫個圖來看看：

import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title("Param vs CV Error");

大概6~7的時候達到了最優解

Step 6: 上傳結果

rf = RandomForestRegressor(n_estimators=30, max_depth=6) rf.fit(X_train, y_train) y_pred = rf.predict(X_test) pd.DataFrame({"id": test_ids, "relevance": y_pred}).to_csv('submission.csv',index=False)

總結：
這一篇教程中，雖然都是用的最簡單的方法，但是基本框架是很完整的。
同學們可以嘗試修改/調試/升級的部分是：
文本預處理步驟: 你可以使用很多不同的方法來使得文本數據變得更加清潔
自制的特征: 相處更多的特征值表達方法（關鍵詞全段重合數量，重合比率，等等）
更好的回歸模型: 根據之前的課講的Ensemble方法，把分類器提升到極致

總結

以上是生活随笔為你收集整理的NLP原理及基础的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。