神经网络架构搜索_神经网络架构
神經網絡架構搜索
Marketing and product teams are tasked with understanding customers. To do so, they look at customer preferences — motivations, expectations and inclinations — which in combination with customer needs drive their purchasing decisions.
中號 arketing和產品團隊的任務是了解客戶。 為此,他們著眼于客戶的偏好 -動機,期望和傾向-結合客戶需求來驅動他們的購買決策 。
In my years as a data scientist I learned that customers — their preferences and needs — rarely (or never?) fall into simple objective buckets or segmentations we use to make sense of them. Instead, customer preferences and needs are complex, intertwined and constantly changing.
在擔任數據科學家的那幾年里,我了解到客戶-他們的喜好和需求-很少(或從來沒有?)落入我們用來理解客戶的簡單客觀的分類或細分中。 相反,客戶的喜好和需求是復雜的,相互交織的并且不斷變化的。
While understanding customers is already challenging enough, many modern digital businesses don’t know much about their products either. They operate digital platforms to facilitate the exchange between producers and consumers. The digital platform business model creates markets and communities with network effects that allow their users to interact and transact. The platform business does not control their inventory via a supply chain like linear businesses do.
雖然了解客戶已經具有足夠的挑戰性,但許多現代數字企業也不了解他們的產品。 他們運營數字平臺,以促進生產者和消費者之間的交流。 數字平臺業務模型通過網絡效應來創建市場和社區,從而使其用戶進行交互和交易。 平臺業務不像線性業務那樣通過供應鏈控制其庫存。
Mohamed Hassan from PixabayMohamed Hassan在Pixabay上發布A good way to describe the platform business is that they do not own the means of production but they instead create the means of connection. Examples of a platform business are Amazon, Facebook, YouTube, Twitter, Ebay, AirBnB, a Property Portal like Zillo, and aggregator businesses like travel booking websites. Over the last few decades the platform businesses came to dominate the economy.
描述平臺業務的一個好方法是,他們不擁有生產資料,而是創建聯系方式 。 平臺業務的示例包括Amazon , Facebook , YouTube , Twitter , Ebay , AirBnB , Zillo之類的Property Portal和旅行預訂網站之類的聚合業務。 在過去的幾十年中,平臺業務開始主導經濟。
How can we use AI to make sense of our customers and products in the age of the platform business?
在平臺業務時代,我們如何使用AI來理解我們的客戶和產品?
This blog post is a continuation of my previous discussion on the new gold standard of behavioural data in Marketing:
這篇博客文章是我之前關于市場營銷行為數據新金標準的討論的延續:
In this blog post we use a more advanced Deep Neural Network to model customers and products.
在此博客文章中,我們使用更高級的深度神經網絡對客戶和產品進行建模。
神經網絡架構 (The Neural Network Architecture)
ProSymbols, ProSymbols , lastspark, lastspark , Juan Pablo BravoJuan Pablo BravoWe use a deep Neural Network with the following elements:
我們使用具有以下要素的深度神經網絡:
Encoder: takes input data describing products or customers and maps it into Feature Embeddings. (An embedding is defined as a projection of some input into another more convenient representation space)
編碼器 :獲取描述產品或客戶的輸入數據,并將其映射到功能嵌入中。 (嵌入定義為某些輸入到另一個更方便的表示空間中的投影)
Comparator: combines customer and product feature embeddings into a Preferences Tensor.
比較器 :將客戶和產品功能嵌入到首選項Tensor中 。
Predictor: turns the preferences into a predictive purchase propensity
預測者 :將偏好轉變為預測購買傾向
We use the neural network to predict product purchases as a target as we know that purchase decisions are driven by a customer’s preferences and needs. Therefore we teach the encoders to extract such preferences and needs from customer behavioural data, customer and product attributes.
由于我們知道購買決策是由客戶的偏好和需求決定的,因此我們使用神經網絡將產品購買作為目標進行預測。 因此,我們教編碼器從客戶行為數據,客戶和產品屬性中提取此類偏好和需求。
We can analyse and cluster the learned customer and product features to derive a data driven segmentation. More on this later.
我們可以分析和聚類學習到的客戶和產品功能,以得出數據驅動的細分。 稍后再詳細介紹。
Morning Brew onMorning Brew在UnsplashUnsplash拍攝TensorFlow實施 (TensorFlow Implementation)
The following code uses TensorFlow 2 and Keras to implement our Neural Network architecture:
以下代碼使用TensorFlow 2和Keras來實現我們的神經網絡體系結構:
The code creates TensorFlow feature columns and can use numerical as well as categorical features. We are using the Keras functional API to define our customer preference neural network which can be compiled with the Adam optimiser using a binary cross-entropy as the loss function.
該代碼創建TensorFlow特征列,并且可以使用數字特征和分類特征 。 我們正在使用Keras功能API定義我們的客戶偏好神經網絡,該網絡可以使用亞當優化器使用二進制交叉熵作為損失函數進行編譯。
使用Spark訓練數據 (Training Data with Spark)
We will need training data for our customer preference model. As a platform business your raw data will fall into the Big Data category. To prepare TB of raw data from click streams, product searches and transactions we use Spark. The challenge is to bridge the two technologies and feed the training data from Spark into TensorFlow.
我們將需要針對我們的客戶偏好模型的培訓數據。 作為平臺業務,您的原始數據將屬于大數據類別。 為了從點擊流,產品搜索和交易中準備TB的原始數據,我們使用Spark。 挑戰在于將這兩種技術聯系起來并將培訓數據從Spark饋入TensorFlow。
[OC][OC]The best format for big amounts of TensorFlow training data is to store it in the TFRecord file format, TensorFlow’s own binary storage format based on Protocol Buffers. The binary format greatly improves the performance of loading data and feeding it into model training. If you were to use, for example, csv files you will spend significant compute resources on loading and parsing your data rather than on training your neural network. The TFRecord file format makes sure your data pipeline is not bottlenecking your neural network training.
大量TensorFlow訓練數據的最佳格式是將其存儲為TFRecord文件格式 ,這是TensorFlow自己基于協議緩沖區的二進制存儲格式。 二進制格式大大提高了加載數據并將其輸入模型訓練的性能。 例如,如果要使用csv文件,則將花費大量的計算資源來加載和解析數據,而不是訓練神經網絡。 TFRecord文件格式可確保您的數據管道不會成為神經網絡訓練的瓶頸。
The Spark-TensorFlow connector allows us to save TFRecords with Spark. Simply add it as a JAR to a new Spark session as follows:
Spark-TensorFlow連接器允許我們使用Spark保存TFRecords。 只需將其作為JAR添加到新的Spark會話中,如下所示:
spark = (SparkSession.builder
.master("yarn")
.appName(app_name)
.config("spark.submit.deployMode", "cluster")
.config("spark.jars.packages","org.tensorflow:spark-tensorflow-connector_2.11:1.15.0")
.getOrCreate()
)
and write a Spark DataFrame to TFRecords as follows:
并將Spark DataFrame寫入TFRecords,如下所示:
(training_feature_df
.write.mode("overwrite")
.format("tfrecords")
.option("recordType", "Example")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save(path)
)
To load the TFRecords with TensorFlow you define the schema of your records and parse the data set into an iterator of python dictionaries using the TensorFlow dataset APIs:
要使用TensorFlow加載TFRecords,您需要定義記錄的架構,并使用TensorFlow數據集API將數據集解析為python詞典的迭代器:
SCHEMA = {"col_name1": tf.io.FixedLenFeature([], tf.string, default_value="Null"),
"col_name2: tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}data = (
tf.data.TFRecordDataset(list_of_file_paths, compression_type="GZIP")
.map(
lambda record: tf.io.parse_single_example(record, SCHEMA),
num_parallel_calls=num_of_workers
)
.batch(num_of_records)
.prefetch(num_of_batches)
)
使用Spark和PandasUDF進行批量計分 (Batch Scoring with Spark and PandasUDFs)
After training our Neural Network there are obvious real-time scoring applications, for example, scoring search results in a product search to address choice paralysis on platforms with thousands and millions of products.
在訓練了我們的神經網絡之后,便有了明顯的實時評分應用程序,例如,在產品搜索中對搜索結果進行評分,以解決擁有成千上萬種產品的平臺上的選擇癱瘓。
But there is an advanced analytics use-case to look at the product/ user features and preferences for insights and to create a data driven segmentation to help with product development etc. For this we score our entire customer base and product catalogue to capture the outputs of the Encoders and Comparator of our model for clustering.
但是,有一個高級的分析用例可以查看產品/用戶的功能和偏好以獲取見解,并創建數據驅動的細分以幫助產品開發等。為此,我們對整個客戶群和產品目錄進行評分,以獲取輸出聚類模型的編碼器和比較器的設計Design 。
To capture the output of intermediary neural network layers we can reshape our trained TensorFlow as follows:
為了捕獲中間神經網絡層的輸出,我們可以按如下所示重塑我們訓練有素的TensorFlow:
trained_customer_preference_model = tf.keras.models.load_model(path)customer_feature_model = tf.keras.Model(
inputs=trained_customer_preference_model.input,
outputs=trained_customer_preference_model.get_layer(
"customer_features").output
)
We score our users with Spark using a PandasUDF to score a batch of users at a time for performance reasons:
出于性能方面的考慮,我們使用PandasUDF為Spark的用戶評分,以便一次為一批用戶評分:
from pyspark.sql import functions as Fimport numpy as np
import pandas as pdspark = SparkSession.builder.getOrCreate()
customerFeatureModelWrapper = CustomerFeatureModelWrapper(path)
CUSTOMER_FEATURE_MODEL = spark.sparkContext.broadcast(customerFeatureModelWrapper)@F.pandas_udf("array<float>", F.PandasUDFType.SCALAR)
def customer_features_udf(*cols):
model_input = dict(zip(FEATURE_COL_NAMES, cols))
model_output = CUSTOMER_FEATURE_MODEL.value([model_input])
return pd.Series([np.array(v) for v in model_output.tolist()])(
customer_df
.withColumn(
"customer_features",
customer_features_udf(*model_input_cols)
)
)
We have to wrap our TensorFlow model into a wrapper class to allow serialisation, broadcasting across the Spark cluster and de-serialisation of the model on all workers. I use MLflow to track model artifacts but you could store them simply on any cloud storage without MLflow. Implement a download function fetching model artifacts from S3 or wherever you store your model.
我們必須將TensorFlow模型包裝到包裝器類中,以允許序列化,在Spark集群中廣播以及在所有worker上對該模型進行反序列化。 我使用MLflow跟蹤模型工件,但是您可以將它們簡單地存儲在任何沒有MLflow的云存儲中。 實現下載功能,以從S3或存儲模型的任何地方獲取模型工件。
class CustomerFeatureModelWrapper(object):def __init__(self, model_path):
self.model_path = model_path
self.model = self._build(model_path)
def __getstate__(self):
return self.model_path
def __setstate__(self, model_path):
self.model_path = model_path
self.model = self._build(model_path)
def _build(self, model_path):
local_path = download(model_path)
return tf.keras.models.load_model(local_path)
You can read more about how MLflow can help you with your Data Science Projects in my previous article:
在上一篇文章中,您可以了解有關MLflow如何幫助您進行數據科學項目的更多信息:
聚類和細分 (Clustering and Segmentation)
After scoring our customer base and product inventory with Spark we have a dataframe with feature and preference vectors as follows:
在使用Spark對我們的客戶群和產品庫存進行評分之后,我們得到了一個具有特征向量和偏好向量的數據框,如下所示:
+-----------+---------------------------------------------------+|product_id |product_features |
+-----------+---------------------------------------------------+
|product_1 |[-0.28878614, 2.026503, 2.352102, -2.010809, ... |
|product_2 |[0.39889023, -0.06328985, 1.634547, 3.3479023, ... |
+-----------+---------------------------------------------------+Pixabay)Pixabay )
As a first step, we have to create a representative but much smaller sample of customers and products to use in clustering. It is important that you stratify your sample with equal numbers of customers and products per strata. Commonly, we have many anonymous customers with little customer attributes such as demographics etc. for stratification. In such a situation, we can stratify customers by the product attributes of the products the customers interact with as a proxy. This follows our general assumption that their preferences and needs drive their purchase decisions. In Spark you create a new column with the strata key. Get the total counts of customers and products by strata and calculate the faction per strata to sample approximately even counts by strata. You can use Spark’s
第一步,我們必須創建一個具有代表性的樣本,但是要用于集群的客戶和產品樣本要小得多。 你與客戶的數量相等 ,每個階層的產品分層你的樣品是很重要的。 通常,我們有許多匿名客戶,幾乎沒有客戶屬性(例如人口統計信息)進行分層。 在這種情況下,我們可以通過與客戶交互的產品的產品屬性來對客戶進行分層,以作為代理。 這是根據我們的普遍假設,即他們的偏好和需求決定他們的購買決定。 在Spark中,您可以使用strata鍵創建一個新列。 按層獲取客戶和產品的總數,并計算每個層的派系,以按層抽樣平均數 。 您可以使用Spark的
DataFrameStatFunctions.sampleBy(col_with_strata_keys, dict_of_sample_fractions, seed)to create a stratified sample.
創建分層樣本 。
To create our segmentation we use T-SNE to visualise the high-dimensional feature vectors of our stratified data sample. T-SNE is a stochastic ML algorithm to reduce dimensionality for visualisation purposes in a way that similar customers and products cluster together. This is also called a neighbour embedding. We can use additional product attributes to colour the t-sne results to interpret our clusters as part of our analysis to generate insights. After we obtain the results from T-SNE, we run DBSCAN on the T-SNE neighbour embeddings to find our clusters.
為了創建我們的分割,我們使用T-SNE可視化分層數據樣本的高維特征向量。 T-SNE是一種隨機ML算法,用于降低可視化目的的維數,以類似的客戶和產品聚集在一起的方式進行。 這也稱為鄰居嵌入 。 我們可以使用其他產品屬性為t-sne結果著色,以解釋我們的集群,這是我們進行分析以生成見解的一部分。 從T-SNE獲得結果后,我們在T-SNE鄰居嵌入上運行DBSCAN以找到我們的集群 。
[OC][OC]With the cluster labels from the DBSCAN output we can calculate cluster centroids:
使用DBSCAN輸出中的集群標簽,我們可以計算集群質心 :
centroids = products[["product_features", "cluster"]].groupby(["cluster"])["product_features"].apply(
lambda x: np.mean(np.vstack(x), axis=0)
)cluster
0 [0.5143338, 0.56946456, -0.26320028, 0.4439753...
1 [0.42414477, 0.012167327, -0.662183, 1.2258132...
2 [-0.0057945233, 1.2221531, -0.22178105, 1.2349...
...
Name: product_embeddings, dtype: object
After we obtained our cluster centroids, we assign all our customer base and product catalogue to their representative cluster. Because so far, we only worked with a stratified sample of maybe 50,000 customer and products.
在獲得集群質心之后 ,我們將所有客戶群和產品目錄分配給它們的代表集群。 因為到目前為止,我們僅處理了大約50,000個客戶和產品的分層樣本。
We use again Spark to assign all our customers and products to their closest cluster centroid. We use the L1 norm (or taxicab distance) to calculate the distance of customers/products to cluster centroids to emphasis the per feature alignment.
我們再次使用Spark將所有客戶和產品分配給最接近的群集質心。 我們使用L1范數 (或出租車出租車距離)來計算客戶/產品到聚類質心的距離,以強調每個功能的對齊方式 。
distance_udf = F.udf(lambda x, y, i: float(np.linalg.norm(np.array(x) - np.array(y), axis=0, ord=i)), FloatType())customer_centroids = spark.read.parquet(path)customer_clusters = (
customer_dataframe
.crossJoin(
F.broadcast(customer_centroids)
)
.withColumn("distance", distance_udf("customer_centroid", "customer_features", F.lit(1)))
.withColumn("distance_order", F.row_number().over(Window.partitionBy("customer_id").orderBy("distance")))
.filter("distance_order = 1")
.select("customer_id", "cluster", "distance")
)+-----------+-------+---------+
|customer_id|cluster| distance|
+-----------+-------+---------+
| customer_1| 4|13.234212|
| customer_2| 4| 8.194665|
| customer_3| 1| 8.00042|
| customer_4| 3|14.705576|
We can then summarise our customer base to get the cluster prominence:
然后,我們可以總結我們的客戶群,以突出顯示集群 :
total_customers = customer_clusters.count()(
customer_clusters
.groupBy("cluster")
.agg(
F.count("customer_id").alias("customers"),
F.avg("distance").alias("avg_distance")
)
.withColumn("pct", F.col("customers") / F.lit(total_customers))
)+-------+---------+------------------+-----+
|cluster|customers| avg_distance| pct|
+-------+---------+------------------+-----+
| 0| xxxx|12.882028355869513| xxxx|
| 5| xxxx|10.084179072882444| xxxx|
| 1| xxxx|13.966814632296622| xxxx|
This completes all the steps needed to derive a data driven segmentation from our Neural Network embeddings:
這樣就完成了從神經網絡嵌入中導出數據驅動的分割所需的所有步驟:
[OC][OC]Read more about segmentation and ways to extract insights from our model in my previous article:
在我之前的文章中,詳細了解細分和從我們的模型中提取見解的方法:
實時評分 (Real-time scoring)
To learn more about how to deploy a model for real-time scoring I recommend my previous article on the topic:
要了解有關如何部署模型進行實時評分的更多信息,我建議我上一篇有關該主題的文章:
一般說明和建議 (General Notes and Advice)
Compared to the collaborative filtering approach in the linked article, the Neural network learns to generalise and a trained model can be used with new customers and new products. The Neural Network has no cold start problem.
與鏈接文章中的協作過濾方法相比,神經網絡可以進行概括,并且訓練有素的模型可以用于新客戶和新產品。 神經網絡沒有冷啟動問題。
If you use at least some behavioural data as input for your customers in addition to historic purchases and other customer profile data, your trained model can make purchase propensity predictions even for new customers without any transactional or customer profile data.
如果除了歷史性購買和其他客戶資料數據之外,您至少還使用某些行為數據作為客戶的輸入,那么經過訓練的模型甚至可以針對沒有任何交易或客戶資料數據的新客戶做出購買傾向預測。
The learned product feature embeddings will cluster into a bigger number of distinct clusters than your customer feature embeddings. It’s not unusual that most customers fall into one big cluster. This does NOT mean 90% of your customers are alike. As described in the introduction, most of your customers have complex, intertwined and changing preferences and needs. This means that they cannot be separated into distinct groups. It doesn’t mean that they are the same. The simplification of a cluster is not able to capture this which just reiterates the need for machine learning to make sense of customers.
與您的客戶功能嵌入相比,學習到的產品功能嵌入將聚集到更多數量的不同群集中。 大多數客戶都屬于一個大集群 ,這并不罕見。 這并不意味著90%的客戶都是一樣的。 如引言中所述,您的大多數客戶都具有復雜,相互交織和變化的偏好和需求。 這意味著它們不能分為不同的組。 這并不意味著它們是相同的 。 集群的簡化無法捕捉到這一點,而只是重申了機器學習對客戶有意義的需求。
While many stakeholders will love the insights and segmentation the model can produce, the real value of the model is in its ability to predict a purchase propensity.
盡管許多利益相關者會喜歡該模型可以產生的見識和細分,但該模型的真正價值在于其預測購買傾向的能力。
Jan is a successful thought leader and consultant in the data transformation of companies and has a track record of bringing data science into commercial production usage at scale. He has recently been recognised by dataIQ as one of the 100 most influential data and analytics practitioners in the UK.
Jan是公司數據轉換方面成功的思想領袖和顧問,并且擁有將數據科學大規模應用于商業生產的記錄。 最近,他被dataIQ認可為英國100位最具影響力的數據和分析從業者之一。
Connect on LinkedIn: https://www.linkedin.com/in/janteichmann/
在LinkedIn上連接: https : //www.linkedin.com/in/janteichmann/
Read other articles: https://medium.com/@jan.teichmann
閱讀其他文章: https : //medium.com/@jan.teichmann
翻譯自: https://towardsdatascience.com/customer-preferences-in-the-age-of-the-platform-business-with-the-help-of-ai-98b0eabf42d9
神經網絡架構搜索
總結
以上是生活随笔為你收集整理的神经网络架构搜索_神经网络架构的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 苹果xr电信卡能用吗
- 下一篇: raspberry pi_通过串行蓝牙从