PinnerSage模型
Aditya Pal | Applied Science, Chantat Eksombatchai | Applied Science, Yitong Zhou | User Understanding, Bo Zhao | User Understanding, Charles Rosenberg | Applied Science, Jure Leskovec | Applied Science
Aditya Pal | 應(yīng)用科學(xué),Chantat Eksombatchai | 應(yīng)用科學(xué),周一彤| 用戶理解,趙波| 用戶理解,Charles Rosenberg | 應(yīng)用科學(xué),Jure Leskovec | 應(yīng)用科學(xué)
As we build a visual discovery engine that powers 2B+ Pins, it’s crucial to understand user interests and preferences in order to serve relevant content. One standard approach to encode user preferences is via an embedding-based representation in a high dimensional space. Most prior methods tried at Pinterest infer a single high-dimensional embedding for each user in compatibility with the content embedding. This is a good starting point but falls short in delivering a full understanding of the user.
當(dāng)我們構(gòu)建為2B +引腳供電的視覺發(fā)現(xiàn)引擎時(shí),了解用戶的興趣和偏好以提供相關(guān)內(nèi)容至關(guān)重要。 一種對(duì)用戶偏好進(jìn)行編碼的標(biāo)準(zhǔn)方法是通過在高維空間中進(jìn)行基于嵌入的表示。 在Pinterest上嘗試的大多數(shù)現(xiàn)有方法都會(huì)為每個(gè)用戶推斷與內(nèi)容嵌入兼容的單個(gè)高維嵌入。 這是一個(gè)很好的起點(diǎn),但不足以全面了解用戶。
In this work, we postulate that a single embedding is not sufficient for encoding multiple facets of a user’s interests that might have no obvious linkage between them. They can evolve, with some interests persisting long term while others span a short time period. Recommended items are also represented in the same embedding space. A good embedding must encode a user’s multiple tastes, interests, styles, etc., whereas a recommended item (a video, an image, a news article, a house listing, a pin, etc.) typically only has a single focus. Hence it becomes important to represent a user with multiple embeddings, with each embedding capturing a specific aspect of their interest.
在這項(xiàng)工作中,我們假設(shè)單個(gè)嵌入不足以對(duì)用戶興趣的多個(gè)方面進(jìn)行編碼,而這些方面之間可能沒有明顯的聯(lián)系。 它們可以發(fā)展,某些利益可以長期持續(xù),而其他利益則可以在很短的時(shí)間內(nèi)出現(xiàn)。 推薦項(xiàng)目也顯示在相同的嵌入空間中。 良好的嵌入必須對(duì)用戶的多種口味,興趣,風(fēng)格等進(jìn)行編碼,而推薦項(xiàng)(視頻,圖像,新聞,房屋列表,圖釘?shù)?通常只有一個(gè)重點(diǎn)。 因此,重要的是用多個(gè)嵌入來表示用戶,每個(gè)嵌入都捕獲他們感興趣的特定方面。
PinnerSage模型 (PinnerSage Model)
In order to better understand our users’ preferences, we developed PinnerSage, a highly scalable, flexible and extensible recommender system that internally represents each user with multiple embeddings. Figure 1 provides an end-to-end overview of the PinnerSage recommendation model. The starting point for our model is to organize the repins and clicks of a user into multiple interest clusters by running the Ward clustering model and then generating a summary of each of those clusters using a medoid, an embedding, and a cluster importance score. Next, a subset of these clusters are picked by the online cluster selection, and it employs a nearest-neighbor index to generate recommendations to the user. Users’ actions are processed in real-time to update the interest clusters. In order for PinnerSage to provide relevant recommendations to our 400M+ monthly active users and adapt in real-time, we made several model design choices that we describe next.
為了更好地了解我們的用戶的偏好,我們開發(fā)了PinnerSage,這是一種高度可擴(kuò)展,靈活且可擴(kuò)展的推薦系統(tǒng),可以在內(nèi)部代表每個(gè)用戶多個(gè)嵌入對(duì)象。 圖1提供了PinnerSage推薦模型的端到端概述。 我們模型的出發(fā)點(diǎn)是,通過運(yùn)行Ward聚類模型 ,然后使用medoid,嵌入和聚類重要性評(píng)分,將用戶的撥動(dòng)和點(diǎn)擊組織到多個(gè)興趣聚類中,以生成每個(gè)聚類的摘要。 接下來,通過在線集群選擇來選擇這些集群的子集,并且它使用最近鄰居索引來生成對(duì)用戶的推薦。 實(shí)時(shí)處理用戶的操作以更新興趣群。 為了讓PinnerSage為我們的400M +月活躍用戶提供相關(guān)建議并實(shí)時(shí)進(jìn)行調(diào)整,我們做出了以下幾個(gè)模型設(shè)計(jì)選擇。
設(shè)計(jì)選擇1:固定引腳嵌入 (Design Choice 1: Pin Embeddings are Fixed)
The interest clusters in Figure 1 are generated by clustering the embeddings of repins and clicks of a user. The embeddings of repins and clicks are trained via the PinSage model that optimizes for contextual and visual similarity between Pins via a Graph convolutional model. Since our goal is to project users in the same space as the Pin embedding space, we consider the Pin embeddings to be fixed. This design choice simplifies our models considerably and allows us to run inference pipelines in parallel for each user.
圖1中的興趣聚類是通過聚類用戶的圖釘和點(diǎn)擊的嵌入而生成的。 通過PinSage模型訓(xùn)練圖釘和點(diǎn)擊的嵌入,該模型通過Graph卷積模型優(yōu)化圖釘之間的上下文和視覺相似性。 由于我們的目標(biāo)是在與Pin嵌入空間相同的空間中投影用戶,因此我們認(rèn)為Pin嵌入是固定的。 這種設(shè)計(jì)選擇大大簡化了我們的模型,并允許我們?yōu)槊總€(gè)用戶并行運(yùn)行推理管道。
Joint embedding inference models, where both user and Pin embeddings are inferred together, can be too complex and hard to scale. Moreover, we posit that in practice they compromise recommendation relevance, as some spurious connections between pins can be established via the users. To see this point, consider the example in Figure 2.
聯(lián)合嵌入推理模型(將用戶嵌入和Pin嵌入一起推斷)可能太復(fù)雜且難以擴(kuò)展。 此外,我們認(rèn)為在實(shí)踐中它們會(huì)損害推薦的相關(guān)性,因?yàn)榭梢酝ㄟ^用戶在引腳之間建立一些虛假的連接。 要了解這一點(diǎn),請(qǐng)考慮圖2中的示例。
Figure 2: Three interests of a given user.圖2:給定用戶的三個(gè)興趣。In the above example figure, a user is interested in painting, shoes, and sci-fi. Jointly learned users and Pin embeddings would bring pin embeddings on these disparate topics closer, which can compromise the relevance of the nearest neighbor-based recommender. Pin embeddings should only operate on the underlying principle of bringing similar pins closer while keeping the rest of the pins as far as possible. For this reason, we use PinSage, which precisely achieves this objective without any dilution.
在上面的示例圖中,用戶對(duì)繪畫,鞋子和科幻小說感興趣。 共同學(xué)習(xí)的用戶和Pin嵌入將使這些不同主題上的pin嵌入更加緊密,這可能會(huì)損害最近的基于鄰居的推薦者的相關(guān)性。 引腳嵌入僅應(yīng)遵循使相似的引腳靠近的基本原則,同時(shí)將其余的引腳保持盡可能遠(yuǎn)。 因此,我們使用PinSage,無需任何稀釋即可精確實(shí)現(xiàn)此目標(biāo)。
設(shè)計(jì)選擇2:無限的用戶嵌入 (Design Choice 2: Unlimited User Embeddings)
Prior work either fixes the number of embeddings to a small number or puts an upper bound on them. At best, such restrictions hinder developing a full understanding of the users and, at worst, merge different concepts together, leading to bad recommendations. For example, merging embeddings could yield an embedding that lies in a very different region. Figure 2 shows that a merger of three disparate pin embeddings results in an embedding that is best represented by the concept energy boosting breakfast. Needless to say, recommendations based on such a merger can be problematic.
先前的工作要么將嵌入數(shù)量固定為少量,要么將嵌入數(shù)量設(shè)置為上限。 最好的情況是,這樣的限制阻礙了對(duì)用戶的全面了解,最壞的情況是將不同的概念融合在一起,從而導(dǎo)致不好的建議。 例如,合并嵌入可能會(huì)產(chǎn)生位于非常不同區(qū)域中的嵌入。 圖2顯示,三個(gè)不同的針狀嵌入物的合并產(chǎn)生的嵌入物最好用能量增強(qiáng)早餐的概念來表示。 不用說,基于這樣的合并的建議可能會(huì)有問題。
PinnerSage generates as many interest clusters as the underlying data supports. This is achieved by clustering users’ actions into conceptually coherent clusters via a hierarchical agglomerative clustering algorithm (Ward). A light user might get represented by 3–5 clusters, whereas a heavy user might get represented by 75–100 clusters.
PinnerSage會(huì)生成基礎(chǔ)數(shù)據(jù)支持的盡可能多的興趣集群。 這是通過層次化的聚集聚類算法(Ward)將用戶的行為聚類為概念上一致的聚類來實(shí)現(xiàn)的。 輕量級(jí)用戶可能由3–5個(gè)群集代表,而重度用戶可能由75–100個(gè)群集代表。
設(shè)計(jì)選擇3:基于Medoid的集群表示 (Design Choice 3: Medoid-based Cluster Representation)
Typically, clusters are represented by centroid, which requires storing an embedding. Additionally, centroid can be sensitive to outliers in the cluster. To compactly represent a cluster, we pick a cluster member pin, called medoid. Medoid, by definition, is a member of the user’s originally interacted pin set. Hence it avoids the pit-fall of topic drift and is robust to outliers. From a systems perspective, medoid is a concise way of representing a cluster, as it only requires storage of medoid’s pin id, and leads to cross-user and even cross-application cache sharing. It also allows our system to be compatible with other non-embedding-based recommendation systems such as Pixie.
通常,群集以質(zhì)心表示,需要存儲(chǔ)嵌入。 此外,質(zhì)心可能對(duì)聚類中的離群值敏感。 為了緊湊地表示群集,我們選擇一個(gè)群集成員銷釘,稱為medoid。 根據(jù)定義,Medoid是用戶最初交互的密碼集的成員。 因此,它避免了主題漂移的陷阱,并且對(duì)異常值具有魯棒性。 從系統(tǒng)角度來看,medoid是表示集群的一種簡潔方法,因?yàn)樗鼉H需要存儲(chǔ)medoid的pin ID,并導(dǎo)致跨用戶甚至跨應(yīng)用程序的緩存共享。 它還使我們的系統(tǒng)與其他基于非嵌入的推薦系統(tǒng)(例如Pixie)兼容。
設(shè)計(jì)選擇4:候選檢索的Medoid采樣 (Design Choice 4: Medoid Sampling for Candidate Retrieval)
PinnerSage provides a rich representation of a user via cluster medoids. However, in practice we cannot use all the medoids simultaneously for candidate retrieval due to cost concerns. Additionally, the user would be bombarded with too many different items. To address these concerns, we sample 3 medoids proportional to their importance scores and recommend their nearest neighboring pins. The importance scores of medoids are updated daily, and they can adapt with the user’s changing tastes.
PinnerSage通過聚類medoids提供了豐富的用戶表示。 但是,實(shí)際上,由于成本方面的考慮,我們無法同時(shí)使用所有類固醇進(jìn)行候選檢索。 另外,用戶會(huì)被太多不同的物品轟炸。 為了解決這些問題,我們對(duì)3種類固醇按其重要性得分成比例進(jìn)行采樣,并推薦與它們最接近的相鄰引腳。 類固醇的重要性評(píng)分每天都會(huì)更新,它們可以適應(yīng)用戶不斷變化的口味。
設(shè)計(jì)選擇5:處理實(shí)時(shí)更新的兩管齊下的方法 (Design Choice 5: Two-Pronged Approach for Handling Real-Time Updates)
It is important for a recommender system to adapt to the current needs of its users. At the same time, an accurate representation of users requires looking at their past 60–90 days of activities. Sheer size of the data and the speed at which it grows makes it hard to consider both aspects together. We address this issue by combining two methods: (a) a daily batch inference job that infers multiple medoids per user based on their long-term interaction history, and (b) an online version of the same model that infers medoids based on the users’ interactions on the current day. As new activity comes in, only the online version is updated. At the end of the day, the batch version consumes the current day’s activities and resolves any inconsistencies. This approach ensures that our system adapts quickly to the users’ current needs and at the same time does not compromise their long-term interests.
推薦系統(tǒng)必須適應(yīng)其用戶的當(dāng)前需求,這一點(diǎn)很重要。 同時(shí),要準(zhǔn)確地表示用戶,需要查看他們過去60-90天的活動(dòng)。 數(shù)據(jù)的龐大規(guī)模及其增長速度使得很難同時(shí)考慮這兩個(gè)方面。 我們通過以下兩種方法來解決此問題:(a)每日批處理推斷工作,根據(jù)其長期交互歷史來推斷每個(gè)用戶的多個(gè)類固醇;以及(b)相同模型的在線版本,根據(jù)用戶來推斷類固醇當(dāng)天的互動(dòng)。 隨著新活動(dòng)的到來,僅在線版本被更新。 在一天結(jié)束時(shí),批處理版本將消耗當(dāng)日的活動(dòng)并解決所有不一致問題。 這種方法可確保我們的系統(tǒng)快速適應(yīng)用戶的當(dāng)前需求,同時(shí)又不損害其長期利益。
A / B測試 (A/B Tests)
PinnerSage is currently deployed in production and used by many products within Pinterest, ranging from Homefeed, Related Pins, Ads, Shopping, and Creators, in both their retrieval and ranking ML models. Our wins on the initial A/B test on two surfaces are highlighted in Table 1.
PinnerSage目前已在生產(chǎn)中部署,并且在Pinterest內(nèi)的許多產(chǎn)品(包括Homefeed,相關(guān)的Pins,廣告,購物和Creators)的檢索和排名ML模型中都使用。 表1突出顯示了我們?cè)趦蓚€(gè)表面上進(jìn)行初始A / B測試的勝利。
Table 1 shows that PinnerSage provides significant engagement gains on increasing overall engagement volume (repins and clicks) as well as increasing engagement propensity (repins and clicks per user). Any gain can be directly attributed to increased quality and diversity of PinnerSage recommendations.
表1顯示,PinnerSage通過增加總體參與量(回復(fù)和點(diǎn)擊)以及提高參與傾向(每位用戶的回復(fù)和點(diǎn)擊)可以顯著提高參與度。 任何收益都可以直接歸因于PinnerSage建議的質(zhì)量提高和多樣性。
Table 1: A/B test of PinnerSage vs current production, which includes a single embedding model.表1:PinnerSage與當(dāng)前產(chǎn)品的A / B測試,其中包括一個(gè)嵌入模型。結(jié)論 (Conclusion)
We proposed an end-to-end system, called PinnerSage, that powers personalized recommendation at Pinterest. In contrast to prior production systems that are based on a single embedding-based user representation, PinnerSage proposes a multi-embedding-based user representation scheme. Our proposed clustering scheme ensures that we get full insight into the needs of a user and understand them better. To make this happen, we adopt several design choices that allow our system to run efficiently and effectively. Our large A/B tests show that PinnerSage provides significant gains in user engagement. Much of the improvements delivered by our model can be attributed to its better understanding of user interests and its quick response to their needs.
我們提出了一個(gè)名為PinnerSage的端到端系統(tǒng),該系統(tǒng)可為Pinterest提供個(gè)性化推薦。 與基于單個(gè)基于嵌入的用戶表示的現(xiàn)有生產(chǎn)系統(tǒng)相比,PinnerSage提出了一種基于多嵌入的用戶表示方案。 我們提出的集群方案可確保我們?nèi)媪私庥脩粜枨蟛⒏玫乩斫馑鼈儭?為了實(shí)現(xiàn)這一目標(biāo),我們采用了幾種設(shè)計(jì)選擇,使我們的系統(tǒng)能夠高效運(yùn)行。 我們的大型A / B測試表明,PinnerSage可顯著提高用戶參與度。 我們的模型提供的許多改進(jìn)都可以歸因于其對(duì)用戶興趣的更好理解以及對(duì)用戶需求的快速響應(yīng)。
附錄 (Appendix)
PinnerSage paper is to appear in KDD 2020. Read more details about the paper here: https://arxiv.org/abs/2007.03634
PinnerSage論文將出現(xiàn)在KDD 2020中。有關(guān)此論文的更多詳細(xì)信息,請(qǐng)?jiān)L問: https ://arxiv.org/abs/2007.03634
致謝 (Acknowledgements)
We would like to extend our appreciation to Homefeed and Shopping teams for helping in setting up online A/B experiments. Our special thanks to the embedding infrastructure team for powering embedding nearest neighbor search.
我們要感謝Homefeed和Shopping團(tuán)隊(duì)幫助建立在線A / B實(shí)驗(yàn)。 我們特別感謝嵌入基礎(chǔ)架構(gòu)團(tuán)隊(duì)為嵌入最近鄰居搜索提供支持。
翻譯自: https://medium.com/pinterest-engineering/pinnersage-multi-modal-user-embedding-framework-for-recommendations-at-pinterest-bfd116b49475
總結(jié)
以上是生活随笔為你收集整理的PinnerSage模型的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 信号处理深度学习机器学习_机器学习与信号
- 下一篇: 零信任模型_关于信任模型