日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Text-CNN-文本分类-keras

發(fā)布時(shí)間:2025/5/22 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Text-CNN-文本分类-keras 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

?

Text?CNN

1.?簡介

TextCNN?是利用卷積神經(jīng)網(wǎng)絡(luò)對文本進(jìn)行分類的算法,由?Yoon?Kim?在?“Convolutional?Neural?Networks?for?Sentence?Classification”?一文中提出.?是2014年的算法.

我們將實(shí)現(xiàn)一個(gè)類似于Kim?Yoon的卷積神經(jīng)網(wǎng)絡(luò)語句分類的模型。?本文提出的模型在一系列文本分類任務(wù)(如情感分析)中實(shí)現(xiàn)了良好的分類性能,并已成為新的文本分類架構(gòu)的標(biāo)準(zhǔn)基準(zhǔn)。

?

2.準(zhǔn)備好需要的庫和數(shù)據(jù)集

  • tensorflow
  • h5py
  • hdf5
  • keras
  • numpy
  • itertools
  • collections
  • re
  • sklearn 0.19.0

準(zhǔn)備數(shù)據(jù)集:

鏈接: https://pan.baidu.com/s/1oO4pDHeu3xIgkDtkLgQEVA 密碼: 6wrv

3.?數(shù)據(jù)和預(yù)處理

我們使用的數(shù)據(jù)集是?Movie?Review?data?from?Rotten?Tomatoes,也是原始文獻(xiàn)中使用的數(shù)據(jù)集之一。?數(shù)據(jù)集包含10,662個(gè)示例評論句子,正負(fù)向各占一半。?數(shù)據(jù)集的大小約為1M。?請注意,由于這個(gè)數(shù)據(jù)集很小,我們很可能會(huì)使用強(qiáng)大的模型。?此外,數(shù)據(jù)集不附帶拆分的訓(xùn)練/測試集,因此我們只需將20%的數(shù)據(jù)用作?test?set。?

數(shù)據(jù)預(yù)處理的函數(shù)包括以下幾點(diǎn)(data_helpers.py):

?

l?load_data_and_labels()從原始數(shù)據(jù)文件中加載正負(fù)向情感的句子。使用one_hot編碼為每個(gè)句子打上標(biāo)簽;[0,1],[1,0]

l?clean_str()正則化去除句子中的標(biāo)點(diǎn)。

l?pad_sentences()使每個(gè)句子都擁有最長句子的長度,不夠的地方補(bǔ)上<PAD/>。允許我們有效地批量我們的數(shù)據(jù),因?yàn)榕幚碇械拿總€(gè)示例必須具有相同的長度。

?

l?build_vocab()建立單詞的映射,去重,對單詞按照自然順序排序。然后給排好序的單詞標(biāo)記標(biāo)號。構(gòu)建詞的匯索引,并將每個(gè)單詞映射到0到單詞個(gè)數(shù)之間的整數(shù)(詞庫大小)。?每個(gè)句子都成為一個(gè)整數(shù)向量。

?

l?build_input_data()將處理好的句子轉(zhuǎn)換為numpy數(shù)組。

l?load_data()將上述操作整合正在一個(gè)函數(shù)中。

import numpy as np import re import itertools from collections import Counterdef clean_str(string):"""Tokenization/string cleaning for datasets.Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py"""string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)string = re.sub(r"\'s", " \'s", string)string = re.sub(r"\'ve", " \'ve", string)string = re.sub(r"n\'t", " n\'t", string)string = re.sub(r"\'re", " \'re", string)string = re.sub(r"\'d", " \'d", string)string = re.sub(r"\'ll", " \'ll", string)string = re.sub(r",", " , ", string)string = re.sub(r"!", " ! ", string)string = re.sub(r"\(", " \( ", string)string = re.sub(r"\)", " \) ", string)string = re.sub(r"\?", " \? ", string)string = re.sub(r"\s{2,}", " ", string)return string.strip().lower()def load_data_and_labels():"""Loads polarity data from files, splits the data into words and generates labels.Returns split sentences and labels."""# Load data from filespositive_examples = list(open("./data/rt-polarity.pos", "r", encoding='latin-1').readlines())positive_examples = [s.strip() for s in positive_examples]negative_examples = list(open("./data/rt-polarity.neg", "r", encoding='latin-1').readlines())negative_examples = [s.strip() for s in negative_examples]# Split by wordsx_text = positive_examples + negative_examplesx_text = [clean_str(sent) for sent in x_text]x_text = [s.split(" ") for s in x_text]# Generate labelspositive_labels = [[0, 1] for _ in positive_examples]negative_labels = [[1, 0] for _ in negative_examples]y = np.concatenate([positive_labels, negative_labels], 0)return [x_text, y]def pad_sentences(sentences, padding_word="<PAD/>"):"""Pads all sentences to the same length. The length is defined by the longest sentence.Returns padded sentences."""sequence_length = max(len(x) for x in sentences)padded_sentences = []for i in range(len(sentences)):sentence = sentences[i]num_padding = sequence_length - len(sentence)new_sentence = sentence + [padding_word] * num_paddingpadded_sentences.append(new_sentence)return padded_sentencesdef build_vocab(sentences):"""Builds a vocabulary mapping from word to index based on the sentences.Returns vocabulary mapping and inverse vocabulary mapping."""# Build vocabularyword_counts = Counter(itertools.chain(*sentences))# Mapping from index to wordvocabulary_inv = [x[0] for x in word_counts.most_common()]vocabulary_inv = list(sorted(vocabulary_inv))# Mapping from word to indexvocabulary = {x: i for i, x in enumerate(vocabulary_inv)}return [vocabulary, vocabulary_inv]def build_input_data(sentences, labels, vocabulary):"""Maps sentences and labels to vectors based on a vocabulary."""x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])y = np.array(labels)return [x, y]def load_data():"""Loads and preprocessed data for the dataset.Returns input vectors, labels, vocabulary, and inverse vocabulary."""# Load and preprocess datasentences, labels = load_data_and_labels()sentences_padded = pad_sentences(sentences)vocabulary, vocabulary_inv = build_vocab(sentences_padded)x, y = build_input_data(sentences_padded, labels, vocabulary)return [x, y, vocabulary, vocabulary_inv]

?

4.?模型

?

?

第一層將單詞嵌入到低維向量中。?下一層使用多個(gè)過濾器大小對嵌入的字矢量執(zhí)行卷積。?例如,一次滑過3,4或5個(gè)字。池化層選擇使用最大池化。

之后將這三個(gè)卷積池化層結(jié)合起來。接下來,我們將卷積層的max_pooling結(jié)果,使用Flatten層將特征融合成一個(gè)長的特征向量,添加dropout正則,并使用softmax層對結(jié)果進(jìn)行分類。

_______________________________________________________________________________

Layer?(type)?????????????????????Output?Shape??????????Param?#?????Connected?to?????????????????????

===============================================================================

input_1?(InputLayer)????????????? (None,?56)????????????0????????????????????????????????????????????

_______________________________________________________________________________

embedding_1?(Embedding)??????????(None,?56,?256)???????4803840?????input_1[0][0]????????????????????

_______________________________________________________________________________

reshape_1?(Reshape)??????????????(None,?56,?256,?1)????0???????????embedding_1[0][0]????????????????

_______________________________________________________________________________

conv2d_1?(Conv2D)????????????????(None,?54,?1,?512)????393728??????reshape_1[0][0]??????????????????

_______________________________________________________________________________

conv2d_2?(Conv2D)????????????????(None,?53,?1,?512)????524800??????reshape_1[0][0]??????????????????

_______________________________________________________________________________

conv2d_3?(Conv2D)????????????????(None,?52,?1,?512)????655872??????reshape_1[0][0]??????????????????

_______________________________________________________________________________

max_pooling2d_1?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_1[0][0]???????????????????

_______________________________________________________________________________

max_pooling2d_2?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_2[0][0]???????????????????

_______________________________________________________________________________

max_pooling2d_3?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_3[0][0]???????????????????

_______________________________________________________________________________

concatenate_1?(Concatenate)??????(None,?3,?1,?512)?????0???????????max_pooling2d_1[0][0]????????????

???????????????????????????????????????????????????????????????????max_pooling2d_2[0][0]????????????

???????????????????????????????????????????????????????????????????max_pooling2d_3[0][0]????????????

_______________________________________________________________________________

flatten_1?(Flatten)??????????????(None,?1536)??????????0???????????concatenate_1[0][0]??????????????

_______________________________________________________________________________

dropout_1?(Dropout)????????????(None,?1536)??????????0???????????flatten_1[0][0]??????????????????

_______________________________________________________________________________

dense_1?(Dense)????????????????(None,?2)?????????????3074????????dropout_1[0][0]??????????????????

===============================================================================

Total?params:?6,381,314

Trainable?params:?6,381,314

Non-trainable?params:?0

_______________________________________________________________________________

?

?

?

  • ?優(yōu)化器選擇了:adam?
  • ?loss選擇了binary_crossentropy(二分類問題)
  • ?評價(jià)標(biāo)準(zhǔn)為分類問題的標(biāo)準(zhǔn)評價(jià)標(biāo)準(zhǔn)(是否分對)
from keras.layers import Input, Dense, Embedding, Conv2D, MaxPool2D from keras.layers import Reshape, Flatten, Dropout, Concatenate from keras.callbacks import ModelCheckpoint from keras.optimizers import Adam from keras.models import Model from sklearn.model_selection import train_test_split from data_helpers import load_dataprint('Loading data') x, y, vocabulary, vocabulary_inv = load_data()# x.shape -> (10662, 56) # y.shape -> (10662, 2) # len(vocabulary) -> 18765 # len(vocabulary_inv) -> 18765X_train, X_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=42)# X_train.shape -> (8529, 56) # y_train.shape -> (8529, 2) # X_test.shape -> (2133, 56) # y_test.shape -> (2133, 2)sequence_length = x.shape[1] # 56 vocabulary_size = len(vocabulary_inv) # 18765 embedding_dim = 256 filter_sizes = [3,4,5] num_filters = 512 drop = 0.5epochs = 100 batch_size = 30# this returns a tensor print("Creating Model...") inputs = Input(shape=(sequence_length,), dtype='int32') embedding = Embedding(input_dim=vocabulary_size, output_dim=embedding_dim, input_length=sequence_length)(inputs) reshape = Reshape((sequence_length,embedding_dim,1))(embedding)conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape) conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape) conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0) maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1) maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2]) flatten = Flatten()(concatenated_tensor) dropout = Dropout(drop)(flatten) output = Dense(units=2, activation='softmax')(dropout)# this creates a model that includes model = Model(inputs=inputs, outputs=output)checkpoint = ModelCheckpoint('weights.{epoch:03d}-{val_acc:.4f}.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='auto') adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy']) print("Traning Model...") model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, callbacks=[checkpoint], validation_data=(X_test, y_test)) # starts training

?

?

?

?

?

?

轉(zhuǎn)載于:https://www.cnblogs.com/ansang/p/9010370.html

《新程序員》:云原生和全面數(shù)字化實(shí)踐50位技術(shù)專家共同創(chuàng)作,文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的Text-CNN-文本分类-keras的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。