Character-level Convolutional Networks for Text Classification
論文總體結(jié)構(gòu)
本文歷史意義:
?1、構(gòu)建多個文本分類數(shù)據(jù)集,推動文本分類發(fā)展
?2、提出CharTextCNN方法,由于只使用字符信息,所以可以用于多種語言中
?
一、Abstract(通過實驗探究了字符級別卷積神經(jīng)網(wǎng)絡(luò)用于文本分類的有效性,模型取得較好結(jié)果)
? ? ? ?摘要部分講解了本文主要做什么,主要是三個方面,一是從實驗角度探究字符級別卷積神經(jīng)網(wǎng)絡(luò)的有效性,二是構(gòu)造幾個大規(guī)模文本分類數(shù)據(jù)集,三是和對比模型相互比較。
?
二、Introduction(字符級別特征可以有效從原始新號如圖像、語音中提取特征,字符級別也長用于自然語言任務(wù),本文進(jìn)行探索)
? ? ? 背景介紹部分主要從兩個角度展開,一個是卷積神經(jīng)網(wǎng)絡(luò),另一個是字符級別特征,首先介紹了文本分類簡介、CNN的有效性、論述了本文使用字符級別CNN用戶文本分類、文本分類在自然語言處理中的應(yīng)用,最后論述了字符級別信息的使用。
??
三、Character level Convolutional Networks(字符級別的卷積神經(jīng)網(wǎng)絡(luò)模型以及一種數(shù)據(jù)擴(kuò)充方法):
主要寫了卷積計算公式,以及量化字符表大小長度為70,右側(cè)給出網(wǎng)絡(luò)結(jié)構(gòu),6個卷積層、3個全連接,后面給出網(wǎng)絡(luò)模型一些參數(shù)
數(shù)據(jù)增強(qiáng)部分主要說的是用同義詞替換的方法,增加數(shù)據(jù),每個同義詞按照語意的相似性進(jìn)行排序,然后按照一定的數(shù)據(jù)分布進(jìn)行替換。
?
CharTextCNN模型的優(yōu)缺點:
缺點:
? ? ? 1、 字符級別文本長度特別長,不利于處理長文本分類
? ? ? ?2、只使用字符級別信息,所以模型使用到的語意方面信息較少
? ? ? ?3、小語料效果較差
優(yōu)點:
? ? ? ?1、模型結(jié)構(gòu)簡單在大語料上效果很好(3層卷積、3層全連接)
? ? ? ?2、可以用于各種語言,不需要做分詞處理
? ? ? ?3、在噪音比較多的文本上表現(xiàn)較好,因為基本上不存在oov問題
?
四、Comparision Models:
? ? 介紹一些對比分類模型,包括詞袋模型和基于深度學(xué)習(xí)的模型
?
五、Datasets and Results:
? ? ? 幾個文本數(shù)據(jù)集及實驗結(jié)果? ? ?
? ?介紹幾個文本數(shù)據(jù)集的大小,以及后續(xù)介紹每個數(shù)據(jù)集的介紹。
?
六、Discussion:
? ? ? 討論實驗結(jié)果以及一些參數(shù)設(shè)置
? ? ?主要討論不同模型之間的對比效果,主要圍繞數(shù)據(jù)集大小、數(shù)據(jù)集topic為語意還是語法的實驗效果進(jìn)行討論
?
七、Conclusion and Outlook
? ? ?全文總結(jié)、未來展望
?
? ? ?本文關(guān)鍵點:
? ? ? 1、卷積神經(jīng)網(wǎng)絡(luò)能夠有效地提取關(guān)鍵的特征
? ? ? 2、字符級別的特征對于自然語言處理的有效性
? ? ? 3、CharTextCNN模型
?
? ? 創(chuàng)新點:
? ? ? ?1、提出一種新的文本分類模型-CharTextCNN
? ? ? ?2、提出多個大規(guī)模文本分類數(shù)據(jù)集
? ? ? ?3、在多個文本分類數(shù)據(jù)集上取得最好或者非常有競爭力的結(jié)果
?
? ? ?啟發(fā)點:
? ? ? ?1、基于卷積神經(jīng)網(wǎng)絡(luò)的文本分類不需要語言的語法和語義結(jié)構(gòu)的知識
? ? ? ?2、實驗結(jié)果告訴我們沒有一個機(jī)器學(xué)習(xí)模型能夠在各種數(shù)據(jù)集上都能表現(xiàn)的最好
? ? ? ?3、本文從實驗的角度分析了字符級別卷積神經(jīng)網(wǎng)絡(luò)在文本分類任務(wù)上的適用性
?
八、代碼實現(xiàn)?
""" 數(shù)據(jù)預(yù)處理 """# encoding = 'utf-8'import os import torch import json import csvf = open("./data/AG/train.csv") datas = csv.reader(f,delimiter=',',quotechar='"') datas = list(datas)label,data,lowercase = [],[],Truefor row in datas:label.append(int(row[0])-1)text = " ".join(row[1:])if lowercase:text = text.lower()data.append(text)print(label[0:5]) print(data[0:5])[2, 2, 2, 2, 2] ["wall st. bears claw back into the black (reuters) reuters - short-sellers, wall street's dwindling\\band of ultra-cynics, are seeing green again.", 'carlyle looks toward commercial aerospace (reuters) reuters - private investment firm carlyle group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', "oil and economy cloud stocks' outlook (reuters) reuters - soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.", 'iraq halts oil exports from main southern pipeline (reuters) reuters - authorities have halted oil export\\flows from the main pipeline in southern iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on saturday.', 'oil prices soar to all-time record, posing new menace to us economy (afp) afp - tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the us presidential elections.']with open("./data/alphabet.json") as f:alphabet = "".join(json.load(f))def char2Index(char):return alphabet.find(char)l0 = 1014 def ontHotEncode(idx):X = torch.zeros(l0,len(alphabet))for index_char,char in enumerate(data[idx]):if char2Index(char)!=-1:X[index_char][char2Index(char)] = 1.0return X模型理論細(xì)節(jié):
""" 模型代碼 """import torch import torch.nn as nn import numpy as npclass CharTextCNN(nn.Module):def __init__(self,config):super(CharTextCNN,self).__init__()in_features = [config.char_num] + config.features[0:-1]out_features = config.featureskernel_sizes = config.kernel_sizesself.convs = []self.conv1 = nn.Sequential(nn.Conv1d(in_features[0], out_features[0], kernel_size=kernel_sizes[0], stride=1), # 一維卷積nn.BatchNorm1d(out_features[0]), # bn層nn.ReLU(), # relu激活函數(shù)層nn.MaxPool1d(kernel_size=3, stride=3) #一維池化層) # 卷積+bn+relu+pooling模塊self.conv2 = nn.Sequential(nn.Conv1d(in_features[1], out_features[1], kernel_size=kernel_sizes[1], stride=1),nn.BatchNorm1d(out_features[1]),nn.ReLU(),nn.MaxPool1d(kernel_size=3, stride=3))self.conv3 = nn.Sequential(nn.Conv1d(in_features[2], out_features[2], kernel_size=kernel_sizes[2], stride=1),nn.BatchNorm1d(out_features[2]),nn.ReLU())self.conv4 = nn.Sequential(nn.Conv1d(in_features[3], out_features[3], kernel_size=kernel_sizes[3], stride=1),nn.BatchNorm1d(out_features[3]),nn.ReLU())self.conv5 = nn.Sequential(nn.Conv1d(in_features[4], out_features[4], kernel_size=kernel_sizes[4], stride=1),nn.BatchNorm1d(out_features[4]),nn.ReLU())self.conv6 = nn.Sequential(nn.Conv1d(in_features[5], out_features[5], kernel_size=kernel_sizes[5], stride=1),nn.BatchNorm1d(out_features[5]),nn.ReLU(),nn.MaxPool1d(kernel_size=3, stride=3))self.fc1 = nn.Sequential(nn.Linear(8704, 1024), # 全連接層 #((l0-96)/27)*256nn.ReLU(),nn.Dropout(p=config.dropout) # dropout層) # 全連接+relu+dropout模塊self.fc2 = nn.Sequential(nn.Linear(1024, 1024),nn.ReLU(),nn.Dropout(p=config.dropout))self.fc3 = nn.Linear(1024, config.num_classes)def forward(self, x):x = self.conv1(x)x = self.conv2(x)x = self.conv3(x)x = self.conv4(x)x = self.conv5(x)x = self.conv6(x)x = x.view(x.size(0), -1) x = self.fc1(x)x = self.fc2(x)x = self.fc3(x)return xclass config:def __init__(self):self.char_num = 70 # 字符的個數(shù)self.features = [256,256,256,256,256,256] # 每一層特征個數(shù)self.kernel_sizes = [7,7,3,3,3,3] # 每一層的卷積核尺寸self.dropout = 0.5 # dropout大小self.num_classes = 4 # 數(shù)據(jù)的類別個數(shù)config = config() chartextcnn = CharTextCNN(config) test = torch.zeros([64,70,1014]) out = chartextcnn(test)from torchsummary import summarysummary(chartextcnn, input_size=(70,1014))----------------------------------------------------------------Layer (type) Output Shape Param # ================================================================Conv1d-1 [-1, 256, 1008] 125,696BatchNorm1d-2 [-1, 256, 1008] 512ReLU-3 [-1, 256, 1008] 0MaxPool1d-4 [-1, 256, 336] 0Conv1d-5 [-1, 256, 330] 459,008BatchNorm1d-6 [-1, 256, 330] 512ReLU-7 [-1, 256, 330] 0MaxPool1d-8 [-1, 256, 110] 0Conv1d-9 [-1, 256, 108] 196,864BatchNorm1d-10 [-1, 256, 108] 512ReLU-11 [-1, 256, 108] 0Conv1d-12 [-1, 256, 106] 196,864BatchNorm1d-13 [-1, 256, 106] 512ReLU-14 [-1, 256, 106] 0Conv1d-15 [-1, 256, 104] 196,864BatchNorm1d-16 [-1, 256, 104] 512ReLU-17 [-1, 256, 104] 0Conv1d-18 [-1, 256, 102] 196,864BatchNorm1d-19 [-1, 256, 102] 512ReLU-20 [-1, 256, 102] 0MaxPool1d-21 [-1, 256, 34] 0Linear-22 [-1, 1024] 8,913,920ReLU-23 [-1, 1024] 0Dropout-24 [-1, 1024] 0Linear-25 [-1, 1024] 1,049,600ReLU-26 [-1, 1024] 0Dropout-27 [-1, 1024] 0Linear-28 [-1, 4] 4,100 ================================================================ Total params: 11,342,852 Trainable params: 11,342,852 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.27 Forward/backward pass size (MB): 11.29 Params size (MB): 43.27 Estimated Total Size (MB): 54.83 ---------------------------------------------------------------- """ 訓(xùn)練模型部分 """import torch import torch.autograd as autograd import torch.nn as nn import torch.optim as optim from model import CharTextCNN from data import AG_Data from tqdm import tqdm import numpy as np import config as argumentparserconfig = argumentparser.ArgumentParser() # 讀入?yún)?shù)設(shè)置 config.features = list(map(int,config.features.split(","))) # 將features用,分割,并且轉(zhuǎn)成int config.kernel_sizes = list(map(int,config.kernel_sizes.split(","))) # kernel_sizes,分割,并且轉(zhuǎn)成int# 導(dǎo)入訓(xùn)練集 training_set = AG_Data(data_path="/AG/train.csv",l0=config.l0) training_iter = torch.utils.data.DataLoader(dataset=training_set,batch_size=config.batch_size,shuffle=True,num_workers=0)# 導(dǎo)入測試集 test_set = AG_Data(data_path="/AG/test.csv",l0=config.l0)test_iter = torch.utils.data.DataLoader(dataset=test_set,batch_size=config.batch_size,shuffle=False,num_workers=0)model = CharTextCNN(config) # 初始化模型 criterion = nn.CrossEntropyLoss() # 構(gòu)建loss結(jié)構(gòu) optimizer = optim.Adam(model.parameters(), lr=config.learning_rate) #構(gòu)建優(yōu)化器 loss = -1def get_test_result(data_iter,data_set):# 生成測試結(jié)果model.eval()data_loss = 0true_sample_num = 0for data, label in data_iter:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).float()out = model(data)true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy()) # 得到一個batch的預(yù)測正確的樣本個數(shù)acc = true_sample_num / data_set.__len__()return accfor epoch in range(config.epoch):model.train()process_bar = tqdm(training_iter)for data, label in process_bar:if config.cuda and torch.cuda.is_available():data = data.cuda() # 如果使用gpu,將數(shù)據(jù)送進(jìn)goulabel = label.cuda()else:data = torch.autograd.Variable(data).float()label = torch.autograd.Variable(label).squeeze()out = model(data)loss_now = criterion(out, autograd.Variable(label.long()))if loss == -1:loss = loss_now.data.item()else:loss = 0.95*loss+0.05*loss_now.data.item() # 平滑操作process_bar.set_postfix(loss=loss_now.data.item()) # 輸出loss,實時監(jiān)測loss的大小process_bar.update()optimizer.zero_grad() # 梯度更新loss_now.backward()optimizer.step()?
總結(jié)
以上是生活随笔為你收集整理的Character-level Convolutional Networks for Text Classification的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Convolutional Neural
- 下一篇: Bag of Tricks for Ef