Text-CNN-文本分类-keras
?
Text?CNN
1.?簡介
TextCNN?是利用卷積神經(jīng)網(wǎng)絡(luò)對文本進(jìn)行分類的算法,由?Yoon?Kim?在?“Convolutional?Neural?Networks?for?Sentence?Classification”?一文中提出.?是2014年的算法.
我們將實(shí)現(xiàn)一個(gè)類似于Kim?Yoon的卷積神經(jīng)網(wǎng)絡(luò)語句分類的模型。?本文提出的模型在一系列文本分類任務(wù)(如情感分析)中實(shí)現(xiàn)了良好的分類性能,并已成為新的文本分類架構(gòu)的標(biāo)準(zhǔn)基準(zhǔn)。
?
2.準(zhǔn)備好需要的庫和數(shù)據(jù)集
- tensorflow
- h5py
- hdf5
- keras
- numpy
- itertools
- collections
- re
- sklearn 0.19.0
準(zhǔn)備數(shù)據(jù)集:
鏈接: https://pan.baidu.com/s/1oO4pDHeu3xIgkDtkLgQEVA 密碼: 6wrv
3.?數(shù)據(jù)和預(yù)處理
我們使用的數(shù)據(jù)集是?Movie?Review?data?from?Rotten?Tomatoes,也是原始文獻(xiàn)中使用的數(shù)據(jù)集之一。?數(shù)據(jù)集包含10,662個(gè)示例評論句子,正負(fù)向各占一半。?數(shù)據(jù)集的大小約為1M。?請注意,由于這個(gè)數(shù)據(jù)集很小,我們很可能會(huì)使用強(qiáng)大的模型。?此外,數(shù)據(jù)集不附帶拆分的訓(xùn)練/測試集,因此我們只需將20%的數(shù)據(jù)用作?test?set。?
數(shù)據(jù)預(yù)處理的函數(shù)包括以下幾點(diǎn)(data_helpers.py):
?
l?load_data_and_labels()從原始數(shù)據(jù)文件中加載正負(fù)向情感的句子。使用one_hot編碼為每個(gè)句子打上標(biāo)簽;[0,1],[1,0]
l?clean_str()正則化去除句子中的標(biāo)點(diǎn)。
l?pad_sentences()使每個(gè)句子都擁有最長句子的長度,不夠的地方補(bǔ)上<PAD/>。允許我們有效地批量我們的數(shù)據(jù),因?yàn)榕幚碇械拿總€(gè)示例必須具有相同的長度。
?
l?build_vocab()建立單詞的映射,去重,對單詞按照自然順序排序。然后給排好序的單詞標(biāo)記標(biāo)號。構(gòu)建詞的匯索引,并將每個(gè)單詞映射到0到單詞個(gè)數(shù)之間的整數(shù)(詞庫大小)。?每個(gè)句子都成為一個(gè)整數(shù)向量。
?
l?build_input_data()將處理好的句子轉(zhuǎn)換為numpy數(shù)組。
l?load_data()將上述操作整合正在一個(gè)函數(shù)中。
import numpy as np import re import itertools from collections import Counterdef clean_str(string):"""Tokenization/string cleaning for datasets.Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py"""string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)string = re.sub(r"\'s", " \'s", string)string = re.sub(r"\'ve", " \'ve", string)string = re.sub(r"n\'t", " n\'t", string)string = re.sub(r"\'re", " \'re", string)string = re.sub(r"\'d", " \'d", string)string = re.sub(r"\'ll", " \'ll", string)string = re.sub(r",", " , ", string)string = re.sub(r"!", " ! ", string)string = re.sub(r"\(", " \( ", string)string = re.sub(r"\)", " \) ", string)string = re.sub(r"\?", " \? ", string)string = re.sub(r"\s{2,}", " ", string)return string.strip().lower()def load_data_and_labels():"""Loads polarity data from files, splits the data into words and generates labels.Returns split sentences and labels."""# Load data from filespositive_examples = list(open("./data/rt-polarity.pos", "r", encoding='latin-1').readlines())positive_examples = [s.strip() for s in positive_examples]negative_examples = list(open("./data/rt-polarity.neg", "r", encoding='latin-1').readlines())negative_examples = [s.strip() for s in negative_examples]# Split by wordsx_text = positive_examples + negative_examplesx_text = [clean_str(sent) for sent in x_text]x_text = [s.split(" ") for s in x_text]# Generate labelspositive_labels = [[0, 1] for _ in positive_examples]negative_labels = [[1, 0] for _ in negative_examples]y = np.concatenate([positive_labels, negative_labels], 0)return [x_text, y]def pad_sentences(sentences, padding_word="<PAD/>"):"""Pads all sentences to the same length. The length is defined by the longest sentence.Returns padded sentences."""sequence_length = max(len(x) for x in sentences)padded_sentences = []for i in range(len(sentences)):sentence = sentences[i]num_padding = sequence_length - len(sentence)new_sentence = sentence + [padding_word] * num_paddingpadded_sentences.append(new_sentence)return padded_sentencesdef build_vocab(sentences):"""Builds a vocabulary mapping from word to index based on the sentences.Returns vocabulary mapping and inverse vocabulary mapping."""# Build vocabularyword_counts = Counter(itertools.chain(*sentences))# Mapping from index to wordvocabulary_inv = [x[0] for x in word_counts.most_common()]vocabulary_inv = list(sorted(vocabulary_inv))# Mapping from word to indexvocabulary = {x: i for i, x in enumerate(vocabulary_inv)}return [vocabulary, vocabulary_inv]def build_input_data(sentences, labels, vocabulary):"""Maps sentences and labels to vectors based on a vocabulary."""x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])y = np.array(labels)return [x, y]def load_data():"""Loads and preprocessed data for the dataset.Returns input vectors, labels, vocabulary, and inverse vocabulary."""# Load and preprocess datasentences, labels = load_data_and_labels()sentences_padded = pad_sentences(sentences)vocabulary, vocabulary_inv = build_vocab(sentences_padded)x, y = build_input_data(sentences_padded, labels, vocabulary)return [x, y, vocabulary, vocabulary_inv]?
4.?模型
?
?
第一層將單詞嵌入到低維向量中。?下一層使用多個(gè)過濾器大小對嵌入的字矢量執(zhí)行卷積。?例如,一次滑過3,4或5個(gè)字。池化層選擇使用最大池化。
之后將這三個(gè)卷積池化層結(jié)合起來。接下來,我們將卷積層的max_pooling結(jié)果,使用Flatten層將特征融合成一個(gè)長的特征向量,添加dropout正則,并使用softmax層對結(jié)果進(jìn)行分類。
_______________________________________________________________________________
Layer?(type)?????????????????????Output?Shape??????????Param?#?????Connected?to?????????????????????
===============================================================================
input_1?(InputLayer)????????????? (None,?56)????????????0????????????????????????????????????????????
_______________________________________________________________________________
embedding_1?(Embedding)??????????(None,?56,?256)???????4803840?????input_1[0][0]????????????????????
_______________________________________________________________________________
reshape_1?(Reshape)??????????????(None,?56,?256,?1)????0???????????embedding_1[0][0]????????????????
_______________________________________________________________________________
conv2d_1?(Conv2D)????????????????(None,?54,?1,?512)????393728??????reshape_1[0][0]??????????????????
_______________________________________________________________________________
conv2d_2?(Conv2D)????????????????(None,?53,?1,?512)????524800??????reshape_1[0][0]??????????????????
_______________________________________________________________________________
conv2d_3?(Conv2D)????????????????(None,?52,?1,?512)????655872??????reshape_1[0][0]??????????????????
_______________________________________________________________________________
max_pooling2d_1?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_1[0][0]???????????????????
_______________________________________________________________________________
max_pooling2d_2?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_2[0][0]???????????????????
_______________________________________________________________________________
max_pooling2d_3?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_3[0][0]???????????????????
_______________________________________________________________________________
concatenate_1?(Concatenate)??????(None,?3,?1,?512)?????0???????????max_pooling2d_1[0][0]????????????
???????????????????????????????????????????????????????????????????max_pooling2d_2[0][0]????????????
???????????????????????????????????????????????????????????????????max_pooling2d_3[0][0]????????????
_______________________________________________________________________________
flatten_1?(Flatten)??????????????(None,?1536)??????????0???????????concatenate_1[0][0]??????????????
_______________________________________________________________________________
dropout_1?(Dropout)????????????(None,?1536)??????????0???????????flatten_1[0][0]??????????????????
_______________________________________________________________________________
dense_1?(Dense)????????????????(None,?2)?????????????3074????????dropout_1[0][0]??????????????????
===============================================================================
Total?params:?6,381,314
Trainable?params:?6,381,314
Non-trainable?params:?0
_______________________________________________________________________________
?
?
?
- ?優(yōu)化器選擇了:adam?
- ?loss選擇了binary_crossentropy(二分類問題)
- ?評價(jià)標(biāo)準(zhǔn)為分類問題的標(biāo)準(zhǔn)評價(jià)標(biāo)準(zhǔn)(是否分對)
?
?
?
?
?
?
轉(zhuǎn)載于:https://www.cnblogs.com/ansang/p/9010370.html
《新程序員》:云原生和全面數(shù)字化實(shí)踐50位技術(shù)專家共同創(chuàng)作,文字、視頻、音頻交互閱讀總結(jié)
以上是生活随笔為你收集整理的Text-CNN-文本分类-keras的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: rsync 备份服务搭建(完成)
- 下一篇: 八年级英语57页答案