當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle:Quora Insincere Questions Classification

發布時間：2023/12/16 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle:Quora Insincere Questions Classification 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

問題描述：

今天任何一個主要網站的存在問題是如何處理有毒（toxic）和分裂（divisive）的內容。 Quora希望正面（head-on）解決（tackle）這個問題，讓他們的平臺成為用戶可以安全地與世界分享知識的地方。

Quora是一個讓人們相互學習的平臺。在Quora上，人們可以提出問題，并與提供獨特見解和質量回答（unique insights and quality answers）的其他人聯系。一個關鍵的挑戰是淘汰（weed out）虛假的問題 - 那些建立在虛假前提（false premises）下的問題，或者打算發表聲明而不是尋求有用答案的問題。

在本次比賽中，Kagglers將開發識別和標記虛假問題（flag insincere questions）的模型。到目前為止（To date），Quora已經使用機器學習和人工審查（manual review）來解決這個問題（address this problem）。在您的幫助下，他們可以開發更具可擴展性的方法（develop more scalable methods）來檢測有毒和誤導性內容（detect toxic and misleading content）。

這是你大規模對抗在線巨魔（combat online trolls at scale）的機會。幫助Quora堅持（uphold）“善良，尊重”（Be Nice, Be Respectful）的政策，繼續成為分享和發展世界知識的地方。

Important Note：（注意）
請注意，這是作為a Kernels Only Competition運行，要求所有submissions都通過Kernel output進行。請仔細閱讀內核常見問題解答和數據頁面，以充分了解其設計方法。

Data Description（數據描述）

在本次比賽中，您將預測Quora上提出的問題是否真誠（sincere）。

一個虛偽的（insincere）問題被定義為一個旨在發表聲明而不是尋找有用答案的問題。一些可以表明問題虛偽（insincere）的特征：

具有非中性語氣（Has a non-neutral tone）
- 夸張的語氣（exaggerated tone）強調了一群人的觀點
- 是修辭（rhetorical）的，意味著暗示（meant to imply）關于一群人的陳述
是貶低（disparaging）或煽動性的（inflammatory）
- 建議針對受保護階層的人提出歧視性（discriminatory）觀點，或尋求確認陳規定型觀念（confirmation of a stereotype）
- 對特定的人或一群人進行貶低（disparaging）的攻擊/侮辱（attacks/insults）
- 基于關于一群人的古怪前提（outlandish premise）
- 貶低（Disparages）不可修復（fixable）且無法衡量（measurable）的特征
不是基于現實（Isn’t grounded in reality）
- 基于虛假信息（false information），或包含荒謬的假設（absurd assumptions）
使用性內容（亂倫incest，獸交bestiality，戀童癖pedophilia）來獲得震撼價值，而不是尋求真正的（genuine）答案

訓練數據包括被詢問的問題（question that was asked），以及是否被識別為真誠的（insincere）（target=1）。真實（ground-truth）標簽包含一些噪音：它們不能保證是完美的。

請注意，數據集中問題的分布不應被視為代表Quora上提出的問題的分布。部分原因是由于采樣程序和已應用于最終數據集的消毒（sanitization）措施的組合。

Data fields（數據域描述）

qid - 唯一的問題標識符
question_text - Quora問題文本
target - 標記為“insincere”的問題的值為1，否則為0

這是僅限內核的比賽（Kernels-only competition）。此數據部分中的文件可供下載，以便在階段1中參考。階段2文件僅在內核中可用且無法下載。

比賽的第二階段會有什么？

在比賽的第二階段，我們將重新運行您選擇的內核。以下文件將與新數據交換：

test.csv - 這將與完整的公共和私有測試數據集交換。該文件在階段1中具有_{56k行，在階段2中具有}376k行。兩個版本的公共頁首數據保持相同。文件名將相同（均為test.csv）以確保您的代碼將運行。
sample_submission.csv - 類似于test.csv，這將從第1階段的_{56k變為第2階段的}376k行。文件名將保持不變。

Embeddings

本次比賽不允許使用外部數據源。但是，我們提供了許多字嵌入以及可以在模型中使用的數據集。這些如下：

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

import os import time import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from tqdm import tqdm import math from sklearn.model_selection import train_test_split from sklearn import metricsfrom keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D from keras.layers import Bidirectional, GlobalMaxPool1D from keras.models import Model from keras import initializers, regularizers, constraints, optimizers, layers

train_df = pd.read_csv("../input/train.csv") test_df = pd.read_csv("../input/test.csv") print("Train shape : ",train_df.shape) print("Test shape : ",test_df.shape)

接下來的步驟如下：

將訓練數據集拆分為train和val樣本。交叉驗證是一個耗時的過程，因此讓我們進行簡單的train val split。
使用’na’填寫文本列中的缺失值
對文本列進行標記（Tokenize the text column）并將其轉換為矢量序列
根據需要填充序列 - 如果文本中的單詞數大于’max_len’，則將它們trunacate為’max_len’，或者如果文本中的單詞數小于’max_len’，則為剩余值添加零。

## split to train and val（劃分數據集） train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=2018)## some config values（一些配置信息） embed_size = 300 # how big is each word vector（詞向量大小） max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)（要使用多少個獨特的單詞） maxlen = 100 # max number of words in a question to use（要使用的問題中的最大單詞數）## fill up the missing values（填充缺失值） train_X = train_df["question_text"].fillna("_na_").values val_X = val_df["question_text"].fillna("_na_").values test_X = test_df["question_text"].fillna("_na_").values## Tokenize the sentences（對句子進行標記） tokenizer = Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(train_X)) train_X = tokenizer.texts_to_sequences(train_X) val_X = tokenizer.texts_to_sequences(val_X) test_X = tokenizer.texts_to_sequences(test_X)## Pad the sentences （填寫句子） train_X = pad_sequences(train_X, maxlen=maxlen) val_X = pad_sequences(val_X, maxlen=maxlen) test_X = pad_sequences(test_X, maxlen=maxlen)## Get the target values（獲取目標值） train_y = train_df['target'].values val_y = val_df['target'].values

沒有預訓練的Embeddings：（Without Pretrained Embeddings）**

現在我們完成了所有必要的預處理步驟，我們可以首先訓練雙向GRU模型。我們不會對此模型使用任何預先訓練過的字嵌入，Embeddings將從頭開始學習。請查看模型摘要，了解所用圖層的詳細信息。

inp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size)(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())

Train the model using train sample and monitor the metric on the valid sample。這只是一個運行2個epochs的樣本模型。改變epochs，batch_size和模型參數可能會為我們提供更好的模型。

## Train the model （訓練模型） model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

現在讓我們獲得驗證樣本預測，并獲得F1分數的最佳閾值。

pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_noemb_val_y>thresh).astype(int))))

現在讓我們獲取測試集預測并保存它們

pred_noemb_test_y = model.predict([test_X], batch_size=1024, verbose=1)

現在我們的模型構建已經完成，在我們進入下一步之前清理一些內存可能是個好主意。

del model, inp, x import gc; gc.collect() time.sleep(10)

因此我們得到了一些基線（baseline）GRU模型，沒有經過預先訓練的嵌入。現在讓我們使用提供的嵌入并再次重建模型以查看性能。

!ls ../input/embeddings/

我們有四種不同類型的嵌入（embeddings）。

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

在這個內核（in this kernel）中給出了對不同類型嵌入的非常好的解釋。有關詳細信息，請參閱相同內容…

Glove Embeddings:
在本節中，讓我們使用Glove嵌入并重建GRU模型。

EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))all_embs = np.stack(embeddings_index.values()) emb_mean,emb_std = all_embs.mean(), all_embs.std() embed_size = all_embs.shape[1]word_index = tokenizer.word_index nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())

model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

結果似乎比沒有預訓練嵌入的模型更好。

pred_glove_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x import gc; gc.collect() time.sleep(10)

Wiki News FastText Embeddings:

現在讓我們使用在Wiki News語料庫上訓練的FastText Embeddings來代替Glove嵌入并重建模型。

EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)all_embs = np.stack(embeddings_index.values()) emb_mean,emb_std = all_embs.mean(), all_embs.std() embed_size = all_embs.shape[1]word_index = tokenizer.word_index nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_fasttext_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_fasttext_val_y>thresh).astype(int))))

pred_fasttext_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x import gc; gc.collect() time.sleep(10)

Paragram Embeddings:

在本節中，我們可以使用段落嵌入并構建模型并進行預測。

EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)all_embs = np.stack(embeddings_index.values()) emb_mean,emb_std = all_embs.mean(), all_embs.std() embed_size = all_embs.shape[1]word_index = tokenizer.word_index nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_paragram_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_paragram_val_y>thresh).astype(int))))

pred_paragram_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x import gc; gc.collect() time.sleep(10)

Observations:（觀察結論）

與非預訓練模型相比，整體預訓練嵌入似乎可以提供更好的結果。
不同預訓練嵌入的性能幾乎相似。

Final Blend:（最后融合）

雖然具有不同預訓練嵌入的模型（pre-trained embeddings）的結果是相似的，但是它們很可能從數據中捕獲不同類型的信息（capture different type of information）。因此，讓我們通過平均他們的預測來混合這三個模型。

pred_val_y = 0.33*pred_glove_val_y + 0.33*pred_fasttext_val_y + 0.34*pred_paragram_val_y for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

結果似乎比單個預訓練模型更好，因此我們讓我們使用此模型混合創建提交文件。

pred_test_y = 0.33*pred_glove_test_y + 0.33*pred_fasttext_test_y + 0.34*pred_paragram_test_y pred_test_y = (pred_test_y>0.35).astype(int) out_df = pd.DataFrame({"qid":test_df["qid"].values}) out_df['prediction'] = pred_test_y out_df.to_csv("submission.csv", index=False)

總結

以上是生活随笔為你收集整理的kaggle:Quora Insincere Questions Classification的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： PLM Agile BOM表结构笔记
下一篇：天天酷跑php源码_run 模仿“天天酷