當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

情感分析和数据集

發布時間：2023/11/28 生活经验 20 豆豆

生活随笔收集整理的這篇文章主要介紹了情感分析和数据集小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

情感分析和數據集

Sentiment Analysis and the Dataset Natural Language Processing:
Applications

如圖1所示，描述使用不同類型的深度學習架構（如MLPs、cnn、rnn和attention）設計自然語言處理模型的基本思想。雖然在圖1中，可以將任何預訓練文本表示與任何體系結構結合起來，用于任何下游的自然語言處理任務，但是選擇了一些具有代表性的組合。具體來說，將探索基于RNNs和CNNs的流行架構來進行情感分析。對于自然語言推理，選擇注意力和MLPs來演示如何分析文本對。最后，介紹了如何在序列級（單個文本分類和文本對分類）和令牌級（文本標記和問答）上微調預訓練的BERT模型。作為一個具體的實證案例，將對BERT進行微調，使其適用于自然語言處理。

BERT需要對廣泛的自然語言處理應用程序進行最小的體系結構更改。然而，這種好處是以為下游應用程序微調大量BERT參數為代價的。在空間或時間有限的情況下，基于MLPs、CNNs、RNNs和attention構建的模型更加可行。接下來，從情感分析應用程序入手，分別闡述了基于RNNs和CNNs的模型設計。

Fig. 1 Pretrained text representations can be fed to various deep learning architectures for different downstream natural language processing applications. This chapter focuses on how to design models for different downstream natural language processing applications.

文本分類是自然語言處理中的一項常見任務，將不確定長度的文本序列轉化為一類文本。類似于本書中最常用的應用程序圖像分類。唯一的區別是，文本分類的例子是文本句子，而不是圖像。

這一部分將著重于為這一領域的一個子問題加載數據：使用文本情感分類來分析文本作者的情緒。這個問題也被稱為情緒分析，有著廣泛的應用。例如，可以分析用戶對產品的評論，以獲得用戶滿意度統計數據，或者分析用戶對市場狀況的情緒，并用來預測未來的趨勢。

from d2l import mxnet as d2l

from mxnet import gluon, np, npx

import os

npx.set_np()

The Sentiment Analysis Dataset

使用斯坦福大學的大型電影評論數據集作為情緒分析的數據集。該數據集分為兩個數據集，用于訓練和測試，每個數據集包含從IMDb下載的25000個電影評論。在每個數據集中，標記為“正”和“負”的注釋數相等。

1.1. Reading the Dataset

首先將這個數據集下載到“…/data”路徑并將其提取到“…/data/aclImdb”。

#@save

d2l.DATA_HUB[‘aclImdb’] = (

'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz','01ada507287d82875905620988597833ad4e0903')

data_dir = d2l.download_extract(‘aclImdb’, ‘aclImdb’)

Downloading …/data/aclImdb_v1.tar.gz from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz…

接下來，閱讀訓練和測試數據集。每個例子都是一個評論，對應的標簽是：1表示“積極”，0表示“消極”。

#@save

def read_imdb(data_dir, is_train):

data, labels = [], []for label in ('pos', 'neg'):folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label)for file in os.listdir(folder_name):with open(os.path.join(folder_name, file), 'rb') as f:review = f.read().decode('utf-8').replace('\n', '')data.append(review)labels.append(1 if label == 'pos' else 0)return data, labels

train_data = read_imdb(data_dir, is_train=True)

print(’# trainings:’, len(train_data[0]))

for x, y in zip(train_data[0][:3], train_data[1][:3]):

print(‘label:’, y, ‘review:’, x[0:60])

# trainings: 25000

label: 1 review: Normally the best way to annoy me in a film is to include so

label: 1 review: The Bible teaches us that the love of money is the root of a

label: 1 review: Being someone who lists Night of the Living Dead at number t

1.2. Tokenization and Vocabulary

使用一個單詞作為標記，然后根據訓練數據集創建字典。

train_tokens = d2l.tokenize(train_data[0], token=‘word’)

vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[’’])

d2l.set_figsize((3.5, 2.5))

d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));

1.3. Padding to the Same Length

因為評審的長度不同，所以不能直接組合成小批量。這里通過截斷或添加“”索引將每條注釋的長度固定為500。

num_steps = 500 # sequence length

train_features = np.array([d2l.truncate_pad(

vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])

train_features.shape

(25000, 500)

1.4. Creating the Data Iterator

將創建一個數據迭代器。每次迭代都將返回一小批數據。

train_iter = d2l.load_array((train_features, train_data[1]), 64)

for X, y in train_iter:

print('X', X.shape, 'y', y.shape)break

‘# batches:’, len(train_iter)

X (64, 500) y (64,)

(’# batches:’, 391)

Putting All Things Together

將把一個函數load_data_imdb保存到d2l中，返回詞匯表和數據迭代器。

#@save

def load_data_imdb(batch_size, num_steps=500):

data_dir = d2l.download_extract('aclImdb', 'aclImdb')train_data = read_imdb(data_dir, True)test_data = read_imdb(data_dir, False)train_tokens = d2l.tokenize(train_data[0], token='word')test_tokens = d2l.tokenize(test_data[0], token='word')vocab = d2l.Vocab(train_tokens, min_freq=5)train_features = np.array([d2l.truncate_pad(vocab[line], num_steps, vocab.unk) for line in train_tokens])test_features = np.array([d2l.truncate_pad(vocab[line], num_steps, vocab.unk) for line in test_tokens])train_iter = d2l.load_array((train_features, train_data[1]), batch_size)test_iter = d2l.load_array((test_features, test_data[1]), batch_size,is_train=False)

return train_iter, test_iter, vocab

Summary

· Text classification can classify a text sequence into a category.

· To classify a text sentiment, we load an IMDb dataset and tokenize its words. Then we pad the text sequence for short reviews and create a data iterator.

總結

以上是生活随笔為你收集整理的情感分析和数据集的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。