推荐系统算法总结(三)——FM与DNN DeepFM
來源:https://blog.csdn.net/qq_23269761/article/details/81366939,如有不妥,請(qǐng)隨時(shí)聯(lián)系溝通,謝謝~
0.瘋狂安利一個(gè)博客
FM的前世今生:?
https://tracholar.github.io/machine-learning/2017/03/10/factorization-machine.html#%E7%BB%BC%E8%BF%B0
1.FM 與 DNN和embedding的關(guān)系
先來復(fù)習(xí)一下FM?
?
?
對(duì)FM模型進(jìn)行求解后,對(duì)于每一個(gè)特征xi都能夠得到對(duì)應(yīng)的隱向量vi,那么這個(gè)vi到底是什么呢?
想一想Google提出的word2vec,word2vec是word embedding方法的一種,word embedding的意思就是,給出一個(gè)文檔,文檔就是一個(gè)單詞序列,比如 “A B A C B F G”, 希望對(duì)文檔中每個(gè)不同的單詞都得到一個(gè)對(duì)應(yīng)的向量(往往是低維向量)表示。比如,對(duì)于這樣的“A B A C B F G”的一個(gè)序列,也許我們最后能得到:A對(duì)應(yīng)的向量為[0.1 0.6 -0.5],B對(duì)應(yīng)的向量為[-0.2 0.9 0.7] 。
所以結(jié)論就是:?
FM算法是一個(gè)特征組合以及降維的工具,它能夠?qū)⒃疽驗(yàn)閛ne-hot編碼產(chǎn)生的稀疏特征,進(jìn)行兩兩組合后還能做一個(gè)降維!!降到多少維呢?就是FM中隱因子的個(gè)數(shù)k
2.FNN
利用FM做預(yù)訓(xùn)練實(shí)現(xiàn)embedding,再通過DNN進(jìn)行訓(xùn)練?
?
這樣的模型則是考慮了高階特征,而在最后sigmoid輸出時(shí)忽略了低階特征本身。
3.DeepFM
鑒于上述理論,目前新出的很多基于深度學(xué)習(xí)的CTR模型都從wide、deep(即低階、高階)兩方面同時(shí)進(jìn)行考慮,進(jìn)一步提高模型的泛化能力,比如DeepFM。?
參考博客:https://blog.csdn.net/zynash2/article/details/79348540?
?
可以看到,整個(gè)模型大體分為兩部分:FM和DNN。簡(jiǎn)單敘述一下模型的流程:借助FNN的思想,利用FM進(jìn)行embedding,之后的wide和deep模型共享embedding之后的結(jié)果。DNN的輸入完全和FNN相同(這里不用預(yù)訓(xùn)練,直接把embedding層看作一層的NN),而通過一定方式組合后,模型在wide上完全模擬出了FM的效果(至于為什么,論文中沒有詳細(xì)推導(dǎo),本文會(huì)稍后給出推導(dǎo)過程),最后將DNN和FM的結(jié)果組合后激活輸出。
需要著重強(qiáng)調(diào)理解的時(shí)模型中關(guān)于FM的部分,究竟時(shí)如何搭建網(wǎng)絡(luò)計(jì)算2階特征的?
**劃重點(diǎn):**embedding層對(duì)于DNN來說時(shí)在提取特征,對(duì)于FM來說就是他的2階特征啊!!!!只不過FM和DNN共享embedding層而已。
4.DeepFM代碼解讀
先放代碼鏈接:?
https://github.com/ChenglongChen/tensorflow-DeepFM?
數(shù)據(jù)下載地址:?
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
4.0 項(xiàng)目目錄
?
data:存儲(chǔ)訓(xùn)練數(shù)據(jù)與測(cè)試數(shù)據(jù)?
output/fig:用來存放輸出結(jié)果和訓(xùn)練曲線?
config:數(shù)據(jù)獲取和特征工程中一些參數(shù)的設(shè)置?
DataReader:特征工程,獲得真正用于訓(xùn)練的特征集合?
main:主程序入口?
mertics:定義了gini指標(biāo)作為評(píng)價(jià)指標(biāo)?
DeepFM:模型定義
4.1 整體過程
推薦一篇此數(shù)據(jù)集的EDA分析,看過可以對(duì)數(shù)據(jù)集的全貌有所了解:?
https://blog.csdn.net/qq_37195507/article/details/78553581
- 1._load_data()
def _load_data():
dfTrain = pd.read_csv(config.TRAIN_FILE)
dfTest = pd.read_csv(config.TEST_FILE)
def preprocess(df):
cols = [c for c in df.columns if c not in ["id", "target"]]
df["missing_feat"] = np.sum((df[cols] == -1).values, axis=1)
df["ps_car_13_x_ps_reg_03"] = df["ps_car_13"] * df["ps_reg_03"]
return df
dfTrain = preprocess(dfTrain)
dfTest = preprocess(dfTest)
cols = [c for c in dfTrain.columns if c not in ["id", "target"]]
cols = [c for c in cols if (not c in config.IGNORE_COLS)]
X_train = dfTrain[cols].values
y_train = dfTrain["target"].values
X_test = dfTest[cols].values
ids_test = dfTest["id"].values
cat_features_indices = [i for i,c in enumerate(cols) if c in config.CATEGORICAL_COLS]
return dfTrain, dfTest, X_train, y_train, X_test, ids_test, cat_features_indices
首先讀取原始數(shù)據(jù)文件TRAIN_FILE,TEST_FILE?
preprocess(df)添加了兩個(gè)特征分別是missing_feat【缺失特征個(gè)數(shù)】與ps_car_13_x_ps_reg_03【兩個(gè)特征的乘積】?
返回:?
dfTrain, dfTest :所有特征都存在的Dataframe形式?
X_train, X_test:刪掉了IGNORE_COLS的ndarray格式 【X_test后面都沒有用到啊】?
y_train: label?
ids_test:測(cè)試集的id,ndarray?
cat_features_indices:類別特征的特征indices
- 利用X_train, y_train 進(jìn)行了K折均衡交叉驗(yàn)證切分?jǐn)?shù)據(jù)集
- DeepFM參數(shù)設(shè)置
- 2._run_base_model_dfm
def _run_base_model_dfm(dfTrain, dfTest, folds, dfm_params):
fd = FeatureDictionary(dfTrain=dfTrain, dfTest=dfTest,
numeric_cols=config.NUMERIC_COLS,
ignore_cols=config.IGNORE_COLS)
data_parser = DataParser(feat_dict=fd)
Xi_train, Xv_train, y_train = data_parser.parse(df=dfTrain, has_label=True)
Xi_test, Xv_test, ids_test = data_parser.parse(df=dfTest)
dfm_params["feature_size"] = fd.feat_dim
dfm_params["field_size"] = len(Xi_train[0])
y_train_meta = np.zeros((dfTrain.shape[0], 1), dtype=float)
y_test_meta = np.zeros((dfTest.shape[0], 1), dtype=float)
_get = lambda x, l: [x[i] for i in l]
gini_results_cv = np.zeros(len(folds), dtype=float)
gini_results_epoch_train = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
gini_results_epoch_valid = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
for i, (train_idx, valid_idx) in enumerate(folds):
Xi_train_, Xv_train_, y_train_ = _get(Xi_train, train_idx), _get(Xv_train, train_idx), _get(y_train, train_idx)
Xi_valid_, Xv_valid_, y_valid_ = _get(Xi_train, valid_idx), _get(Xv_train, valid_idx), _get(y_train, valid_idx)
dfm = DeepFM(**dfm_params)
dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)
y_train_meta[valid_idx,0] = dfm.predict(Xi_valid_, Xv_valid_)
y_test_meta[:,0] += dfm.predict(Xi_test, Xv_test)
gini_results_cv[i] = gini_norm(y_valid_, y_train_meta[valid_idx])
gini_results_epoch_train[i] = dfm.train_result
gini_results_epoch_valid[i] = dfm.valid_result
y_test_meta /= float(len(folds))
# save result
if dfm_params["use_fm"] and dfm_params["use_deep"]:
clf_str = "DeepFM"
elif dfm_params["use_fm"]:
clf_str = "FM"
elif dfm_params["use_deep"]:
clf_str = "DNN"
print("%s: %.5f (%.5f)"%(clf_str, gini_results_cv.mean(), gini_results_cv.std()))
filename = "%s_Mean%.5f_Std%.5f.csv"%(clf_str, gini_results_cv.mean(), gini_results_cv.std())
_make_submission(ids_test, y_test_meta, filename)
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid, clf_str)
return y_train_meta, y_test_meta
經(jīng)過?
DataReader中的FeatureDictionary?
這個(gè)對(duì)象中有一個(gè)self.feat_dict屬性,長(zhǎng)下面這個(gè)樣子:
- ?
DataReader中的DataParser
class DataParser(object):
def __init__(self, feat_dict):
self.feat_dict = feat_dict #這個(gè)feat_dict是FeatureDictionary對(duì)象實(shí)例
def parse(self, infile=None, df=None, has_label=False):
assert not ((infile is None) and (df is None)), "infile or df at least one is set"
assert not ((infile is not None) and (df is not None)), "only one can be set"
if infile is None:
dfi = df.copy()
else:
dfi = pd.read_csv(infile)
if has_label:
y = dfi["target"].values.tolist()
dfi.drop(["id", "target"], axis=1, inplace=True)
else:
ids = dfi["id"].values.tolist()
dfi.drop(["id"], axis=1, inplace=True)
# dfi for feature index
# dfv for feature value which can be either binary (1/0) or float (e.g., 10.24)
dfv = dfi.copy()
for col in dfi.columns:
if col in self.feat_dict.ignore_cols:
dfi.drop(col, axis=1, inplace=True)
dfv.drop(col, axis=1, inplace=True)
continue
if col in self.feat_dict.numeric_cols:
dfi[col] = self.feat_dict.feat_dict[col]
else:
dfi[col] = dfi[col].map(self.feat_dict.feat_dict[col])
dfv[col] = 1.
#dfi.to_csv('dfi.csv')
#dfv.to_csv('dfv.csv')
# list of list of feature indices of each sample in the dataset
Xi = dfi.values.tolist()
# list of list of feature values of each sample in the dataset
Xv = dfv.values.tolist()
if has_label:
return Xi, Xv, y
else:
return Xi, Xv, ids
這里Xi,Xv都是二位數(shù)組,可以將dfi,dfv存在csv文件中看一下長(zhǎng)什么樣子,長(zhǎng)的很奇怪【可能后面模型需要吧~】?
dfi:value值為特征index,也就是上文中feat_dict屬性保存的值?
dfv:如果是數(shù)值變量,則保持原本的值,如果是分類變量,則value為1?
4.2 模型架構(gòu)
def _init_graph(self):
self.graph = tf.Graph()
with self.graph.as_default():
tf.set_random_seed(self.random_seed)
self.feat_index = tf.placeholder(tf.int32, shape=[None, None],
name="feat_index") # None * F
self.feat_value = tf.placeholder(tf.float32, shape=[None, None],
name="feat_value") # None * F
self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1
self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")
self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")
self.train_phase = tf.placeholder(tf.bool, name="train_phase")
self.weights = self._initialize_weights()
# model
self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],
self.feat_index) # None * F * K
#print(self.weights["feature_embeddings"]) shape=[259,8] n*k個(gè)隱向量
#print(self.embeddings) shape=[?,39,8] f*k 每個(gè)field取出一個(gè)隱向量[這不是FFM每個(gè)field取是在取非0量,減少計(jì)算]
feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])
#print(feat_value) shape=[?,39*1] 某一個(gè)樣本的39個(gè)Feature值
self.embeddings = tf.multiply(self.embeddings, feat_value) #multiply在有一個(gè)維度不同時(shí),較少的維度會(huì)自行擴(kuò)展
#print(self.embeddings) shape=[?,39*8]
# 所以這個(gè)multiply之后得到的矩陣是Vixi,方便以后進(jìn)行<Vi,Vj>*xi*xj=<Vi*xi,Vj*xj>的計(jì)算,后面的計(jì)算FM被簡(jiǎn)化為了
# sum_square part-square_sum part的形式,采用上面multiply的形式更方便啊!
# ---------- first order term ----------
self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1
self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F
self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F
# ---------- second order term ---------------
# sum_square part
self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K
self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K
# square_sum part
self.squared_features_emb = tf.square(self.embeddings)
self.squared_sum_features_emb = tf.reduce_sum(self.squared_features_emb, 1) # None * K
# second order
self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_sum_features_emb) # None * K
self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K
# ---------- Deep component ----------
self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])
for i in range(0, len(self.deep_layers)):
self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1
if self.batch_norm:
self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1
self.y_deep = self.deep_layers_activation(self.y_deep)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer
# ---------- DeepFM ----------
if self.use_fm and self.use_deep:
concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)
elif self.use_fm:
concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)
elif self.use_deep:
concat_input = self.y_deep
self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])
不知道為什么這篇代碼把FM寫的看起來很復(fù)雜。人家復(fù)雜是有原因的!!避免了使用one-hot編碼后的大大大矩陣?
其實(shí)就是embedding層Deep和FM共用了隱向量【feature_size*k】矩陣
所以這個(gè)實(shí)現(xiàn)的重點(diǎn)在embedding層啊,這里的實(shí)現(xiàn)方式通過Xi,Xv兩個(gè)較小的矩陣【n*field】注意這里field不是FFM中的F,而是未one-hot編碼前的Feature數(shù)量。?
根據(jù)內(nèi)積的公式我們可以得到
總結(jié)
以上是生活随笔為你收集整理的推荐系统算法总结(三)——FM与DNN DeepFM的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: FM算法python实现
- 下一篇: Innodb锁系统 Insert/Del