當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLP的特征工程

發(fā)布時間：2023/12/15 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP的特征工程小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

nlp 特征工程

機器學習，自然語言處理 (Machine Learning, Natural Language Processing)

Now that we processed the data and removed unwanted noise from our text in the last story, it’s time to process it and get the desired output from it. Machine Learning algorithms learn from a pre-defined set of features from the training data to produce output for the test data. But the main problem in working with language processing is that machine learning algorithms cannot work on the raw text directly. So, we need some feature extraction techniques to convert text into a matrix(or vector) of features.

既然我們已經處理了數(shù)據(jù)并在上一個故事中消除了文本中的有害噪聲，那么該是時候對其進行處理并從中獲得所需的輸出了。機器學習算法從訓練數(shù)據(jù)的一組預定義功能中學習，以生成測試數(shù)據(jù)的輸出。但是使用語言處理的主要問題是機器學習算法無法直接在原始文本上運行。因此，我們需要一些特征提取技術來將文本轉換為特征矩陣(或向量)。

This article will take you through different methods of transforming text into features and using them in both machine and deep learning. Feature extraction is mainly focussed on two methods:

本文將帶您了解將文本轉換為功能并在機器學習和深度學習中使用它們的不同方法。特征提取主要集中在兩種方法上:

Bag of words & TF-IDF
文字袋和TF-IDF
Word embedding
詞嵌入

口碑(BoW) (Bag-of-Words (BoW))

Source: Author資料來源:作者

BoW is one of the simplest ways to convert tokens into features. The BoW model is used in document classification, where each word is used as a feature for training the classifier.

BoW是將令牌轉換為功能的最簡單方法之一。 BoW模型用于文檔分類，其中每個單詞都用作訓練分類器的功能。

BoW creates the matrix of features by assigning the frequency of each feature in rows. BoW creates columns of features and fills row values by their frequency sentence by sentence. Here, I took two sentences, removed some noise from them and preprocessed it as discussed it here, and applied BoW that created a matrix something like this.

BoW通過在行中分配每個特征的頻率來創(chuàng)建特征矩陣。 BoW創(chuàng)建要素列，并按其頻率逐句填充行值。在這里，我拿了兩個句子，從其中刪除了一些噪音，并按照此處的討論對其進行了預處理，然后應用BoW創(chuàng)建了類似這樣的矩陣。

Source: Author資料來源:作者

Here shows a sample code of Bag-of-Words with python. Python let us write code simply with Scikit-learn, which is one of the most popular machines learning library and provides APIs for feature extraction. max_features arguments allow us to extract the total number of features we need to train our model.

這里顯示了使用Python的“詞袋”的示例代碼。 Python讓我們可以使用Scikit-learn簡單地編寫代碼，Scikit-learn是最受歡迎的機器學習庫之一，并提供用于特征提取的API。 max_features參數(shù)允許我們提取訓練模型所需的特征總數(shù)。

# Creating the Bag of Words modelfrom sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=3250)
X = cv.fit(corpus)

A major drawback in using BoW is that the order of occurrence of words in a sentence is lost, as we create a vector of tokens in randomized order. This problem is easily solved by using TF-IDF.

使用BoW的主要缺點是，當我們以隨機順序創(chuàng)建標記向量時，會丟失句子中單詞出現(xiàn)的順序。使用TF-IDF可以輕松解決此問題。

特遣部隊 (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency. The BoW might miss some important words due to low frequency, but that’s not the case in TF-IDF. As you can see in the name, TF-IDF is composed of two parts:

TF-IDF代表術語頻率-逆文檔頻率。 BoW可能會因為頻率較低而錯過一些重要的單詞，但TF-IDF并非如此。如名稱所示，TF-IDF由兩部分組成:

Term Frequency: It is a probability of a given word in a document, i.e.
詞頻 :這是文檔中給定單詞的概率，即

tf(w) = frequency of word w in doc d / total words in doc d

A different technique for calculating tf is log normalization.

計算tf的另一種技術是對數(shù)歸一化。

tf(w) = 1 + log(f) , f is frequency of word w in the document

Inverse Document Frequency: It is a measure of how rare a given word is in a document. The less is the frequency of the word, the more the value of IDF. It is given by a formula:
反向文檔頻率:它是衡量給定單詞在文檔中有多罕見的度量。單詞的頻率越少，IDF的值就越大。它由一個公式給出:

idf(w) = log (N / f), N is total number of documents & f is the frequency of word w

Lastly, combining both the formula derives the equation for TF-IDF:

最后，將這兩個公式結合起來可得出TF-IDF的公式:

tf-idf (w) = tf(w) * idf(w)

sentences = ["Hello! How are you all? This is Deep, an AI enthusiast. AI is a great thing to explore and can make computer smarter."]#import libraryfrom sklearn.feature_extraction.text import TfidfVectorizertf = TfidfVectorizer(max_features=3250, stop_words = 'english')
tf.fit(sentences)
tf.vocabulary_
#Output:
{'hello': 6,
'deep': 2,
'ai': 0,
'enthusiast': 3,
'great': 5,
'thing': 9,
'explore': 4,
'make': 7,
'computer': 1,
'smarter': 8}

詞嵌入 (Word Embedding)

IBMIBM

Word embedding is representing a word in a vector space, with similar words have similar vectors. This technique preserves the similarity or the relationship between similar words.

詞嵌入是在向量空間中表示一個詞， 相似的詞具有相似的向量。該技術保留了相似詞之間的相似性或關系。

Word embedding methods learn a real-valued vector representation for a predefined fixed-sized vocabulary from a corpus of text. Two pre-defined models that are commonly used to create word embedding are:

詞嵌入方法可從文本語料庫中學習預定義固定大小詞匯的實值矢量表示。通常用于創(chuàng)建單詞嵌入的兩個預定義模型是:

Word2vec
Word2vec
Glove
手套

Word2Vec (Word2Vec)

Word2Vec is a popular statistical method developed at Google for learning efficiently from the text corpus. It is made of two layers shallow than an actual neural network to recognize text easily. It first constructs a vocabulary from the training corpus and then learns word embedding representations. Two different approaches are used in word2vec technique:

Word2Vec是Google開發(fā)的一種流行的統(tǒng)計方法，可以從文本語料庫中高效學習。它由比實際神經網絡淺的兩層組成，可以輕松識別文本。它首先從訓練語料庫構建詞匯表，然后學習單詞嵌入表示。 word2vec技術中使用了兩種不同的方法:

Continuous Bag-of-Words, or CBOW model: The CBOW model learns the embedding by predicting the current word based on its context.
連續(xù)詞袋(CBOW)模型:CBOW模型通過根據(jù)上下文預測當前單詞來學習嵌入。
Continuous Skip-Gram Model: The continuous skip-gram model learns by predicting the surrounding words given a current word.
連續(xù)跳過語法模型:連續(xù)跳過語法模型通過預測給定當前單詞的周圍單詞來學習。

Here, we are using Gensim, an API that provides algorithms like word2vec for NLP.

在這里，我們使用的是Gensim，該API提供了針對NLP的word2vec之類的算法。

#pip install gensim, if you haven't installed yet from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec# Word2Vec modeling cbowmodel = Word2Vec(data, size=100, window=5, min_count=1, workers=4)
print("Cosine similarity between 'W1' "+ "and 'W2' - CBOW : ",model1.similarity('W1', 'W2'))sgmodel = = = 100,window = =
print("Cosine similarity between 'W1' "+ "and 'W2' - Skip Gram : ",model2.similarity('W1', 'W2'))

手套 (Glove)

Global Vector or Glove is an unsupervised algorithm, an extension to the word2vec method for efficiently learning word vectors, developed by Pennington, et al. at Stanford.

Pennington等人開發(fā)的“全局向量或手套”是一種無監(jiān)督算法，是對word2vec方法的擴展，該方法可有效學習詞向量。在斯坦福。

from gensim.scripts.glove2word2vec import glove2word2vecglove_input_file = 'glove.txt'
word2vec_output_file = 'word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)filename = 'word2vec_output_file'from gensim.models import KeyedVectors
# load the Stanford GloVe modelmodel = KeyedVectors.load_word2vec_format(filename, binary=False)# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)print(result)

Here, I haven’t gone much into the details of the Glove and Word2vec. I will try to discuss these topics in detail in further articles..

在這里，我對手套和Word2vec的細節(jié)并沒有做過多介紹。我將在以后的文章中嘗試詳細討論這些主題。

Thank you for reading!

感謝您的閱讀！

翻譯自: https://medium.com/towards-artificial-intelligence/feature-engineering-with-nlp-23a97e933e8e