當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Kaggle竞赛：Quora Insincere Questions Classification 总结与心得感想

發布時間：2023/12/16 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Kaggle竞赛：Quora Insincere Questions Classification 总结与心得感想小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這次Quora的文本分類題，4000支參賽隊伍中個人solo最終只在LB上達到了20%，一方面是因為第一次參加NLP方面的比賽，完全是個小白，另一方面是自己在比賽途中也有不少懈怠，因此想做一些技術上以及客觀上的總結警醒自己。

比賽是通過文本訓練集來預測Quora上的問題是真誠的還是不真誠的問題，比賽鏈接https://www.kaggle.com/c/quora-insincere-questions-classification

關于技術上的問題：

在比賽中通過學習各路大神的kernel，同時在很多問題一知半解的情況下，查閱各種文獻資料，我對NLP的文本分類有了一個大致的了解，分詞，語言模型，詞向量等有了初步認識。語言模型n-gram，預訓練詞向量word embedding，以及常用的lstm網絡和GRU網絡等等。文本預處理的各種方法，以及attention layer等模型方案。

客觀上存在的問題：

NLP的比賽和普通的數據挖掘比賽有很大的不同，普通的數據挖掘比賽最重要的需要挖掘到好的特征，其次是使用合適的模型；而NLP更注重模型本身，所以現有的模型中，深度學習的模型在NLP里得到廣泛應用。我自己在這個比賽途中同時也在進行另外一個數據挖掘的比賽，一心二用導致了自己不夠專注。做到一定程度的時候卡在一個地方，就發生了懈怠情緒，這也是需要自己改正的一個地方。

感悟與收獲：

最重要的收獲是感覺自己NLP終于入了門，同時了解了各種前沿的論文對于NLP建模的影響，需要的論文，因為基本上好的nlp模型都是從現有論文中衍生的（當然很多大神是通過比賽驗證自己的模型然后再發Paper），這和從數據中衍生的數據挖掘出的特征真的是有很大的區別。同時這次比賽借鑒了很多別人的方案，在以后更需要做的是站在巨人的肩膀上做出自己的一些想法。在這次比賽后，發現nlp真是一個巨大無比的坑，還有太多需要學習的地方，繼續加油，保持危機感。

以下附比賽代碼以及附注：

源碼鏈接：https://github.com/yyhhlancelot/Kaggle_Quora_Insincere_Question_Classification

首先載入需要用的包：

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import keras import os import os import time import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from tqdm import tqdm import math import gc from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn.model_selection import GridSearchCV, StratifiedKFold from sklearn.metrics import f1_score, roc_auc_score import tensorflow as tf from sklearn.preprocessing import StandardScaler from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D from keras.layers import Bidirectional, MaxPooling1D, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D, BatchNormalization, PReLU from keras.optimizers import Adam from keras.models import Model from keras import backend as K from keras.engine.topology import Layer from keras import initializers, regularizers, constraints, optimizers, layers from keras.layers import concatenate, add from keras.callbacks import *

預處理階段：

清理符號：

def clean_text(x):puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '?', '~', '@', '￡', '·', '_', '{', '}', '?', '^', '?', '`', '<', '→', '°', '€', '?', '?', '?', '←', '×', '§', '″', '′', '?', '█', '?', 'à', '…', '“', '★', '”', '–', '●', 'a', '?', '?', '￠', '2', '?', '?', '?', '↑', '±', '?', '?', '═', '|', '║', '―', '￥', '▓', '—', '?', '─', '?', '：', '?', '⊕', '▼', '?', '?', '■', '’', '?', '¨', '▄', '?', '☆', 'é', 'ˉ', '?', '¤', '▲', 'è', '?', '?', '?', '?', '‘', '∞', '?', '）', '↓', '、', '│', '（', '?', '，', '?', '╩', '╚', '3', '?', '╦', '╣', '╔', '╗', '?', '?', '?', '?', '1', '≤', '?', '√', ]x = str(x)for punct in "/-'":x = x.replace(punct, ' ')for punct in '&':x = x.replace(punct, f' {punct} ')for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':x = x.replace(punct, '')for punct in puncts:x = x.replace(punct, f' {punct} ')return x

使用正則表達式清理數字：

import re def clean_numbers(x):x = re.sub('[0-9]{5,}', '#####', x)x = re.sub('[0-9]{4}', '####', x)x = re.sub('[0-9]{3}', '###', x)x = re.sub('[0-9]{2}', '##', x)return x

清理錯誤拼寫：

def _get_mispell(mispell_dict):mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))return mispell_dict, mispell_remispell_dict = {'colour':'color','centre':'center','didnt':'did not','doesnt':'does not','isnt':'is not','shouldnt':'should not','favourite':'favorite','travelling':'traveling','counselling':'counseling','theatre':'theater','cancelled':'canceled','labour':'labor','organisation':'organization','wwii':'world war 2','citicise':'criticize','instagram': 'social medium','whatsapp': 'social medium','snapchat': 'social medium',"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will","i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is","let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have","needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not","oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not","shan't've": "shall not have","she'd": "she would", "she'd've": "she would have","she'll": "she will","she'll've": "she will have", "she's": "she is","should've": "should have","shouldn't": "should not","shouldn't've": "should not have", "so've": "so have","so's": "so as","this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have","they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is","what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have","you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do','doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}mispellings, mispellings_re = _get_mispell(mispell_dict)def replace_typical_misspell(text):def replace(match):return mispellings[match.group(0)]return mispellings_re.sub(replace, text)

文本預處理：

train_df = pd.read_csv("../input/train.csv")test_df = pd.read_csv("../input/test.csv")print("Train shape : ",train_df.shape) print("Test shape : ",test_df.shape) embed_size = 300 #詞向量維度 max_features = 95000 #設置詞典大小 max_len = 70 #設置輸入的長度 # lower train_df['question_text'] = train_df['question_text'].apply(lambda x : x.lower()) test_df['question_text'] = test_df['question_text'].apply(lambda x : x.lower())# clean the text train_df["question_text"] = train_df["question_text"].apply(lambda x : clean_text(x)) test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_text(x))# clean numbers train_df["question_text"] = train_df["question_text"].apply(lambda x: clean_numbers(x)) test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_numbers(x))# clean spellings train_df['question_text'] = train_df['question_text'].apply(lambda x: replace_typical_misspell(x)) test_df['question_text'] = test_df['question_text'].apply(lambda x: replace_typical_misspell(x))# fill up the missing values train_X = train_df['question_text'].fillna("_##_").values test_X = test_df['question_text'].fillna("_##_").values# tokenize the sentences tokenizer = Tokenizer(num_words = max_features) tokenizer.fit_on_texts(list(train_X)) train_X = tokenizer.texts_to_sequences(train_X) test_X = tokenizer.texts_to_sequences(test_X)# pad the sentences train_X = pad_sequences(train_X, maxlen = max_len) test_X = pad_sequences(test_X, maxlen = max_len)# the target values train_y = train_df['target'].values np.random.seed(666) trn_idx = np.random.permutation(len(train_X))train_X = train_X[trn_idx] train_y = train_y[trn_idx]

載入詞向量：

def load_glove(word_index): # EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/glove.840B.300d/glove.840B.300d.txt'def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8'))all_embs = np.stack(embeddings_index.values())emb_mean,emb_std = all_embs.mean(), all_embs.std()embed_size = all_embs.shape[1]# word_index = tokenizer.word_indexnb_words = min(max_features, len(word_index))embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorreturn embedding_matrix def load_fasttext(word_index): # EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/wiki-news-300d-1M/wiki-news-300d-1M.vec'def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8') if len(o)>100)all_embs = np.stack(embeddings_index.values())emb_mean,emb_std = all_embs.mean(), all_embs.std()embed_size = all_embs.shape[1]# word_index = tokenizer.word_indexnb_words = min(max_features, len(word_index))embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorreturn embedding_matrixdef load_para(word_index): # EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/paragram_300_sl999/paragram_300_sl999.txt'def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)all_embs = np.stack(embeddings_index.values())emb_mean,emb_std = all_embs.mean(), all_embs.std()embed_size = all_embs.shape[1]# word_index = tokenizer.word_indexnb_words = min(max_features, len(word_index))embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorreturn embedding_matrix

注意力機制(Attention Layer):

class Attention(Layer):def __init__(self, step_dim,W_regularizer=None, b_regularizer=None,W_constraint=None, b_constraint=None,bias=True, **kwargs):"""Keras Layer that implements an Attention mechanism for temporal data.Supports Masking.Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]# Input shape3D tensor with shape: `(samples, steps, features)`.# Output shape2D tensor with shape: `(samples, features)`.:param kwargs:Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.The dimensions are inferred based on the output shape of the RNN.Example:model.add(LSTM(64, return_sequences=True))model.add(Attention())"""self.supports_masking = True#self.init = initializations.get('glorot_uniform')self.init = initializers.get('glorot_uniform')self.W_regularizer = regularizers.get(W_regularizer)self.b_regularizer = regularizers.get(b_regularizer)self.W_constraint = constraints.get(W_constraint)self.b_constraint = constraints.get(b_constraint)self.bias = biasself.step_dim = step_dimself.features_dim = 0super(Attention, self).__init__(**kwargs)def build(self, input_shape):assert len(input_shape) == 3self.W = self.add_weight((input_shape[-1],),initializer=self.init,name='{}_W'.format(self.name),regularizer=self.W_regularizer,constraint=self.W_constraint)self.features_dim = input_shape[-1]if self.bias:self.b = self.add_weight((input_shape[1],),initializer='zero',name='{}_b'.format(self.name),regularizer=self.b_regularizer,constraint=self.b_constraint)else:self.b = Noneself.built = Truedef compute_mask(self, input, input_mask=None):# do not pass the mask to the next layersreturn Nonedef call(self, x, mask=None):# eij = K.dot(x, self.W) TF backend doesn't support it# features_dim = self.W.shape[0]# step_dim = x._keras_shape[1]features_dim = self.features_dimstep_dim = self.step_dimeij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))if self.bias:eij += self.beij = K.tanh(eij)a = K.exp(eij)# apply mask after the exp. will be re-normalized nextif mask is not None:# Cast the mask to floatX to avoid float64 upcasting in theanoa *= K.cast(mask, K.floatx())# in some cases especially in the early stages of training the sum may be almost zeroa /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())a = K.expand_dims(a)weighted_input = x * a#print weigthted_input.shapereturn K.sum(weighted_input, axis=1)def compute_output_shape(self, input_shape):#return input_shape[0], input_shape[-1]return input_shape[0], self.features_dim

膠囊網絡：

def squash(x, axis=-1):# s_squared_norm is really small# s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()# scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)# return scale * xs_squared_norm = K.sum(K.square(x), axis, keepdims=True)scale = K.sqrt(s_squared_norm + K.epsilon())return x / scale# A Capsule Implement with Pure Keras class Capsule(Layer):def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,activation='default', **kwargs):super(Capsule, self).__init__(**kwargs)self.num_capsule = num_capsuleself.dim_capsule = dim_capsuleself.routings = routingsself.kernel_size = kernel_sizeself.share_weights = share_weightsif activation == 'default':self.activation = squashelse:self.activation = Activation(activation)def build(self, input_shape):super(Capsule, self).build(input_shape)input_dim_capsule = input_shape[-1]if self.share_weights:self.W = self.add_weight(name='capsule_kernel',shape=(1, input_dim_capsule,self.num_capsule * self.dim_capsule),# shape=self.kernel_size,initializer='glorot_uniform',trainable=True)else:input_num_capsule = input_shape[-2]self.W = self.add_weight(name='capsule_kernel',shape=(input_num_capsule,input_dim_capsule,self.num_capsule * self.dim_capsule),initializer='glorot_uniform',trainable=True)def call(self, u_vecs):if self.share_weights:u_hat_vecs = K.conv1d(u_vecs, self.W)else:u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])batch_size = K.shape(u_vecs)[0]input_num_capsule = K.shape(u_vecs)[1]u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,self.num_capsule, self.dim_capsule))u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))# final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]b = K.zeros_like(u_hat_vecs[:, :, :, 0]) # shape = [None, num_capsule, input_num_capsule]for i in range(self.routings):b = K.permute_dimensions(b, (0, 2, 1)) # shape = [None, input_num_capsule, num_capsule]c = K.softmax(b)c = K.permute_dimensions(c, (0, 2, 1))b = K.permute_dimensions(b, (0, 2, 1))outputs = self.activation(tf.keras.backend.batch_dot(c, u_hat_vecs, [2, 2]))if i < self.routings - 1:b = tf.keras.backend.batch_dot(outputs, u_hat_vecs, [2, 3])return outputsdef compute_output_shape(self, input_shape):return (None, self.num_capsule, self.dim_capsule)def capsule():K.clear_session() inp = Input(shape=(maxlen,))x = Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=False)(inp)x = SpatialDropout1D(rate=0.2)(x)x = Bidirectional(CuDNNGRU(100, return_sequences=True, kernel_initializer=initializers.glorot_normal(seed=12300), recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)x = Capsule(num_capsule=10, dim_capsule=10, routings=4, share_weights=True)(x)x = Flatten()(x)x = Dense(100, activation="relu", kernel_initializer=glorot_normal(seed=12300))(x)x = Dropout(0.12)(x)x = BatchNormalization()(x)x = Dense(1, activation="sigmoid")(x)model = Model(inputs=inp, outputs=x)model.compile(loss='binary_crossentropy', optimizer=Adam(),)return modeldef f1_smart(y_true, y_pred):args = np.argsort(y_pred)tp = y_true.sum()fs = (tp - np.cumsum(y_true[args[:-1]])) / np.arange(y_true.shape[0] + tp - 1, tp, -1)res_idx = np.argmax(fs)return 2 * fs[res_idx], (y_pred[args[res_idx]] + y_pred[args[res_idx + 1]]) / 2

建模：

通過對各路大神的模型進行比較，我找出了幾種比較高效的模型

首先是使用了注意力機制的雙向LSTM/GRU模型：

def model_lstm_atten(embedding_matrix):inp = Input(shape = (max_len,))x = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)x = SpatialDropout1D(0.1)(x)x = Bidirectional(CuDNNLSTM(40, return_sequences = True))(x)y = Bidirectional(CuDNNGRU(40, return_sequences = True))(x)atten_1 = Attention(max_len)(x)atten_2 = Attention(max_len)(y)avg_pool = GlobalAveragePooling1D()(y)max_pool = GlobalMaxPooling1D()(y)conc = concatenate([atten_1, atten_2, avg_pool, max_pool])conc = Dense(16, activation = "relu")(conc)conc = Dropout(0.1)(conc)outp = Dense(1, activation = "sigmoid")(conc)model = Model(inputs = inp, outputs = outp)model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])return model

使用了注意力機制和膠囊網絡的雙向LSTM/GRU模型：

def model_atten_capsule(embedding_matrix): '''0.7'''inp_x = Input(shape = (max_len, ))inp_features = Input(shape = (6, ))x = Embedding(max_features,embed_size, weights = [embedding_matrix], trainable = False)(inp_x)x = SpatialDropout1D(0.1)(x)lstm = Bidirectional(CuDNNLSTM(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)gru = Bidirectional(CuDNNGRU(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(lstm) # x = Bidirectional(CuDNNLSTM(64, return_sequences = True))(x)content3 = Capsule(num_capsule = 10, dim_capsule = 10, routings = 4, share_weights = True)(gru)content3 = Dropout(0.1)(content3) # content3 = Reshape(-1, )(content3)content3 = Flatten()(content3)content3 = Dense(1, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(content3)### 修改了content3atten_lstm = Attention(max_len)(lstm)atten_gru = Attention(max_len)(gru)avg_pool = GlobalAveragePooling1D()(gru)max_pool = GlobalMaxPooling1D()(gru)conc = concatenate([atten_lstm, atten_gru, content3, avg_pool, max_pool, inp_features]) #### 修改了denseconc = Dense(16, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(conc)x = BatchNormalization()(conc)x = Dropout(0.1)(x)outp = Dense(1)(x)model = Model(inputs = [inp_x, inp_features], outputs = outp)model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])return model

CNN模型：

def model_cnn(embedding_matrix):filter_sizes = [1, 2, 3, 5]num_filters = 36inp = Input(shape = (max_len,))x = Embedding(max_features, embed_size, weights = [embedding_matrix])(inp)x = Reshape((max_len, embed_size, 1))(x)maxpool_pool = []for i in range(len(filter_sizes)):conv = Conv2D(num_filters, kernel_size = (filter_sizes[i], embed_size), kernel_initializer = 'he_normal', activation = 'elu')(x)maxpool_pool.append(MaxPool2D(pool_size = (max_len - filter_sizes[i] + 1, 1))(conv))z = Concatenate(axis = 1)(maxpool_pool)z = Flatten()(z)z = Dropout(0.1)(z)outp = Dense(1, activation = "sigmoid")(z)model = Model(inputs = inp, outputs = outp)model.summary()model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])return model

以及我在網上找到了騰訊之前提出的dpCNN模型并進行了一些針對性的修改：

def model_dpcnn(embedding_matrix):filter_nr = 64 filter_size = 3max_pool_size = 3max_pool_strides = 2dense_nr = 256spatial_dropout = 0.1dense_dropout = 0.2train_embed = Falseconv_kern_reg = regularizers.l2(0.00001)conv_bias_reg = regularizers.l2(0.00001)inp = Input(shape = (max_len, ))emb_comment = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp) # emb_comment = SpatialDropout1D(0.1)(emb_comment)#block1block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)block1 = BatchNormalization()(block1)block1 = PReLU()(block1)block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1)block1 = BatchNormalization()(block1)block1 = PReLU()(block1)#we pass embedded comment through conv1d with filter size 1 because it needs to have the same shape as block output#if you choose filter_nr = embed_size (300 in this case) you don't have to do this part and can add emb_comment directly to block1_outputresize_emb = Conv1D(filter_nr, kernel_size = 1, padding = 'same', activation = 'linear', kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)resize_emb = PReLU()(resize_emb)block1_output = add([block1, resize_emb])block1_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block1_output)#block2block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1_output)block2 = BatchNormalization()(block2)block2 = PReLU()(block2)block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2)block2 = BatchNormalization()(block2)block2 = PReLU()(block2)block2_output = add([block2, block1_output])block2_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block2_output)#block3block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2_output)block3 = BatchNormalization()(block3)block3 = PReLU()(block3)block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3)block3 = BatchNormalization()(block3)block3 = PReLU()(block3)block3_output = add([block3, block2_output])block3_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block3_output)#block4block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3_output)block4 = BatchNormalization()(block4)block4 = PReLU()(block4)block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4)block4 = BatchNormalization()(block4)block4 = PReLU()(block4)block4_output = add([block4, block3_output])block4_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block4_output)#block5block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4_output)block5 = BatchNormalization()(block5)block5 = PReLU()(block5)block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5)block5 = BatchNormalization()(block5)block5 = PReLU()(block5)block5_output = add([block5, block4_output])block5_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block5_output)# #block6 # block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear', # kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output) # block6 = BatchNormalization()(block6) # block6 = PReLU()(block6) # block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear', # kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block6) # block6 = BatchNormalization()(block6) # block6 = PReLU()(block6)# block6_output = add([block6, block5_output]) # block6_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block6_output)#block7block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output)block7 = BatchNormalization()(block7)block7 = PReLU()(block7)block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block7)block7 = BatchNormalization()(block7)block7 = PReLU()(block7)block7_output = add([block7, block5_output])outp = GlobalMaxPooling1D()(block7_output) # output = block7_outputoutp = Dense(dense_nr, activation = 'linear')(outp)outp = BatchNormalization()(outp)outp = Dropout(0.1)(outp)outp = Dense(1, activation = 'sigmoid')(outp)model = Model(inputs = inp, outputs = outp)model.summary()model.compile(loss = 'binary_crossentropy',optimizer = 'adam',metrics = ['accuracy'])return model

同時還有一些模型，大致的思路也是用了注意力機制和膠囊網絡以及雙向LSTM或者GRU，只是模型的構造不同，這里就沒有舉出。

詞向量處理與訓練：

embedding_matrix_1 = load_glove(tokenizer.word_index) # embedding_matrix_2 = load_fasttext(tokenizer.word_index) embedding_matrix_3 = load_para(tokenizer.word_index)embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_3], axis = 0) # embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_2], axis = 0) # np.shape(embedding_matrix) del embedding_matrix_1, embedding_matrix_3 gc.collect()

查找最佳閾值：

def threshold_search(y_true, y_prob):best_thresh = 0best_score = 0for thresh in np.arange(0.1, 0.701, 0.01):thresh = np.round(thresh, 2)score = metrics.f1_score(y_true, (y_prob >= thresh).astype(int))print("F1 score at threshold {} is {}".format(thresh, metrics.f1_score(y_true, (y_prob >= thresh).astype(int))))if score > best_score : best_score = scorebest_thresh = threshreturn best_thresh def train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = None, callback = None):if dev_features is None:model.fit(dev_X, dev_y, batch_size = 512, epochs = epochs, validation_data = (val_X, val_y), callbacks = callback, verbose = 0)pred_test_y_temp = model.predict(test_X, batch_size = 1024) # pred_test_y_temp = model.predict(np.concatenate((test_X, test_features), axis = 1), batch_size = 1024)else:model.fit([dev_X, dev_features], dev_y, batch_size = 512, epochs = epochs, validation_data = ([val_X, val_features], val_y), callbacks = callback, verbose = 0)pred_test_y_temp = model.predict([test_X, test_features], batch_size = 1024)return pred_test_y_temp

這里我用了四折交叉驗證，并使用了其中一個模型來訓練：

## ADDITION TRAIN lstm_atten num_splits = 4 skf = StratifiedKFold(n_splits = num_splits, shuffle = True, random_state = 2333) pred_test_y = 0 thresh_use = 0 val_score = 0 for dev_index, val_index in skf.split(train_X, train_y):dev_X, val_X = train_X[dev_index, :], train_X[val_index,:]dev_y, val_y = train_y[dev_index], train_y[val_index] # dev_features, val_features = train_features[dev_index, :], train_features[val_index, :]model = model_lstm_atten(embedding_matrix)pred_test_y_temp = train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = 2, callback = [clr,])pred_val_y = model.predict(val_X, batch_size = 1024)best_thresh = threshold_search(val_y, pred_val_y)val_score_temp = metrics.f1_score(val_y, (pred_val_y > best_thresh).astype(int))print("val temp best f1 score is {0} and best thresh is {1}".format(val_score_temp, best_thresh))thresh_use += best_threshpred_test_y += pred_test_y_tempval_score += val_score_tempkeras.backend.clear_session() pred_test_y /= num_splits thresh_use /= num_splits val_score /= num_splits output.append([pred_test_y, thresh_use, val_score, 'lstm atten glove+para'])

提交

sub = pd.read_csv('../input/sample_submission.csv') sub.prediction = (pred_test_y > thresh_use).astype(int) sub.to_csv("submission.csv", index=False)

由于這個比賽的結果是通過Kernel進行線上提交，也就是說運行時間不能超過兩個小時，所以要合理利用時間，在此基礎上可能需要對策略進行修改，比如交叉驗證的折數，epoch的數量的等。以上基本上就是該次比賽我的全部流程。希望下次能比這次更好。

總結

以上是生活随笔為你收集整理的Kaggle竞赛：Quora Insincere Questions Classification 总结与心得感想的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：面试经验｜华为二面分享真难ε=(´ο｀
下一篇：支付宝免签个人支付宝到银行卡