當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

前置交换机数据交换_我们的数据科学交换所

發布時間：2023/11/29 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了前置交换机数据交换_我们的数据科学交换所小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前置交換機數據交換

The DNC Data Science team builds and manages dozens of models that support a broad range of campaign activities. Campaigns rely on these model scores to optimize contactability, volunteer recruitment, get-out-the-vote, and many other pieces of modern campaigning. One of our responsibilities is to deliver the best available model scores in an accessible, actionable form.

DNC數據科學團隊構建和管理數十種模型，以支持廣泛的競選活動。競選活動依靠這些模型評分來優化聯系能力，志愿者招募，投票表決和現代競選活動的許多其他方面。我們的職責之一是以可訪問，可操作的形式提供最佳的可用模型評分。

As part of Phoenix, the DNC’s data warehouse, we developed infrastructure that keeps our focus on delivering products to win elections instead of on ever-growing technical complexity. In this post, we’ll walk through the infrastructure that manages over 70 billion (and counting!) model scores for the country’s 200+ million registered voters.

作為DNC數據倉庫Phoenix的一部分，我們開發了基礎架構，使我們始終專注于交付贏得選舉的產品，而不是不斷增長的技術復雜性。在這篇文章中，我們將介紹為該國200億以上注冊選民管理的700億(甚至更多)模型評分的基礎架構。

挑戰 (The Challenge)

At this point in the 2020 cycle, we’re managing about 20 different models. These models come from a mix of sources (our internal modeling infrastructure and multiple vendor syncs), and are a mix of regression, binary classification, and multi-class classification model types. Across those 20 models, we have around 80 distinct model versions that comprise more than 70 billion point estimates.

在2020年周期的這一點上，我們正在管理約20種不同的模型。這些模型來自多種來源(我們的內部建?；A結構和多個供應商同步)，并且是回歸，二進制分類和多分類分類模型類型的混合。在這20個模型中，我們有大約80個不同的模型版本，其中包括超過700億個點估計。

So, how do we get from the complexity of mixing so many model sources, model types, and versions-per-model to the clean, accessible set of scores our users can seamlessly integrate into their campaign programs?

那么，如何從混合這么多模型源，模型類型和每個模型版本的復雜性，到用戶可以無縫地集成到他們的廣告系列計劃中的干凈，可訪問的分數集，如何變得復雜呢？

模型分數交換所 (A Clearinghouse for Model Scores)

Our solution is a pair of carefully designed tables for model versions and model scores, and an accompanying code base to cleanly manage model and score life-cycles.

我們的解決方案是為模型版本和模型評分精心設計的一對表格，以及用于干凈地管理模型和評分生命周期的隨附代碼庫。

Together, these tables are a clearinghouse for model score publication. Just as a financial clearinghouse ensures a clean exchange between parties in a transaction, our model score clearinghouse sits between a model’s source data and its downstream pipelines to ensure a clean hand-off from one to the other.

這些表格一起構成了模型評分發布的交換所 。就像金融票據交換所確保交易雙方之間的干凈交換一樣，我們的模型分數票據交換所也位于模型的源數據及其下游管道之間，以確保從一個人到另一個人的徹底交接。

模型版本分類帳 (Model Version Ledger)

The first table of our clearinghouse, model_versions, keeps metadata on model versions. Vendor-sourced model version metadata is merged to this table as part of loading processes. Models we score in-house have their versions checked against and merged into this table as part of every scoring job. With many models and versions spread across our small data science team, we are thrilled to have this bookkeeping maintained programmatically.

我們的票據交換所的第一個表model_versions保留了模型版本的元數據。供應商來源的模型版本元數據在加載過程中會合并到此表中。我們內部評分的模型會對照其版本進行檢查，并作為每次評分工作的一部分合并到此表中。我們的小型數據科學團隊擁有許多模型和版本，我們很高興以編程方式維護此簿記。

權威的模型預測表 (An Authoritative Table of Model Predictions)

The second table, scores, holds, well, scores. As we load incoming model scores, we tag both the model version and scoring job that generated them. For multi-class classification, we store scores in a normalized form and note the predicted class label in the score_name column.

第二張表， scores ，保持得分。加載傳入的模型評分時，我們會同時標記模型版本和生成評分的評分工作。對于多類別分類，我們以標準化形式存儲分數，并在score_name列中注明預測的類別標簽。

A key part of this table’s architecture is that it’s partitioned by the model version’s date. This keeps all of the scores for a given model version on the same partition, allowing us to query just the few gigabytes of data for the scores we need instead of scanning the entire multi-terabyte table.

該表的體系結構的關鍵部分是按模型版本的日期進行分區。這樣可以將給定模型版本的所有分數保留在同一分區上，從而使我們可以僅查詢幾GB數據以獲得所需的分數，而不用掃描整個多TB的表。

Second table of our clearinghouse, scores.我們交換所的第二張桌子，分數。

Once the new scores have passed automated checks, their current_score_flag is flipped to TRUE and their datetime_approved field is set to the current timestamp. In the same step, previous scores of the same model_version that overlap with the new scores by external_id are flipped to FALSE and have their datetime_deprecated set to the current timestamp.

新分數通過自動檢查后，它們的current_score_flag會翻轉為TRUE并且datetime_approved字段會設置為當前時間戳。在相同的步驟，相同的分數以前model_version與新成績通過重疊external_id翻轉到FALSE ，并有自己的datetime_deprecated設置為當前的時間戳。

This operation makes the current_score_flag an authoritative marker for which scores are current within the model_run_id, a key assumption for when it’s time to materialize the scores for downstream use.

此操作使current_score_flag成為權威性標記，其分數在model_run_id內為當前分數，這是何時將分數具體化以供下游使用的關鍵假設。

最后-簡化模型版本的操作 (Finally — Simplified Operations on Model Versions)

All of the bookkeeping in the tables above pays huge dividends when it comes time to query our model scores. This short script extracts the current scores for the current production model version of “My Model” with only the name of the model as an input!

上表中的所有簿記都是在查詢我們的模型分數時要付出的巨大努力。這個簡短的腳本僅使用模型名稱作為輸入來提取當前生產模型版本“My Model”的當前分數！

DECLARE CURRENT_MODEL_VERSION_DATE DATE; DECLARE CURRENT_MODEL_RUN_ID STRING;-- check the model version ledger for the current model metadata SET (CURRENT_MODEL_VERSION_DATE, CURRENT_MODEL_RUN_ID) = (SELECT AS STRUCT model_version_date, model_run_idFROM `modeling.model_versions`WHERE model_name = "My Model"AND current_model_flag is TRUE );-- pull the model's current scores SELECT external_id, score_name, score_value, datetime_approvedFROM `modeling.scores`WHERE model_version_date = CURRENT_MODEL_VERSION_DATEAND model_run_id = CURRENT_MODEL_RUN_IDAND current_score_flag is TRUE ;

我們如何使用它 (How We Use It)

Having a strong, consolidated schema for model versions and scores makes downstream use cases much cleaner and simpler. Here are a few examples of how this infrastructure is used in our work in the 2020 cycle:

具有用于模型版本和評分的強大，統一的架構，可以使下游用例更加簡潔。以下是在2020年周期的工作中如何使用此基礎架構的一些示例：

采購發布管道 (Sourcing Publishing Pipelines)

The most mission-critical use of this infrastructure is to serve the most current scores of each model to downstream pipelines in a consistent location and format. Our scoring and loading pipelines run a “materialize model” task that writes a query similar to the one above to a dataset holding one “current” table per model.

此基礎架構最關鍵的用途是以一致的位置和格式將每個模型的最新分數提供給下游管道。我們的計分和加載流水線運行“物化模型”任務，該任務將與上面類似的查詢寫入到每個模型包含一個“當前”表的數據集中。

With our model and score version bookkeeping managed upstream, our downstream code can always find the model’s current scores in the same place. But more importantly, this approach insulates downstream processes from modeling issues: If a scoring job fails for some reason, or if newly loaded scores do not pass automated quality checks, the model version will not be re-materialized with the problematic scores.

通過在上游管理我們的模型和分數版本簿記，我們的下游代碼始終可以在同一位置找到模型的當前分數。但更重要的是，這種方法將下游流程與建模問題隔離開來：如果計分工作由于某種原因而失敗，或者如果新加載的分數未通過自動質量檢查，則模型版本將不會與有問題的分數重新實現。

模型生命周期的面包屑 (Breadcrumbs for a Model’s Life Cycle)

With Election Day right around the corner, we have to respond quickly and confidently to potential problems with our models. We maintain a complete picture of a model and its scores by linking the model_run_id and scoring_run_id fields to the metadata and artifacts we store in an MLflow tracking instance.

即將到來的選舉日，我們必須對我們模型的潛在問題做出Swift而自信的回應。通過將model_run_id和scoring_run_id字段鏈接到我們存儲在MLflow跟蹤實例中的元數據和工件，我們可以維護模型及其分數的完整圖片。

This allows us to trace a published score back to its scoring job, its training job, and, through our other metadata systems, the exact training data that built the underlying model. This piece is critical for diagnosing issues and anomalies users encounter in the field.

這使我們可以將已發布的分數追溯到評分工作，培訓工作，以及通過我們的其他元數據系統，來構建基礎模型的確切培訓數據。這對于診斷用戶在現場遇到的問題和異?，F象至關重要。

It’s also a gift to our future selves: when it’s time to revisit our models and improve their future versions, we’ll be working with a complete understanding of their inputs and outputs.

這也是我們未來自我的禮物：當需要重新審視我們的模型并改進其未來版本時，我們將全面了解其輸入和輸出。

模型還原 (Model Reversions)

Sometimes, we detect a problem with a model after it’s been published. When this happens, we can adjust the current_model_flag in the model_versions table to toggle a model version out of production, or simply suppress scores from a problematic scoring job. After adjusting the metadata in the clearinghouse, we can then re-materialize the model for downstream pipelines and address the lingering issues.

有時，我們會在模型發布后檢測出問題。發生這種情況時，我們可以調整model_versions表中的current_model_flag來切換模型版本的停產狀態，或僅抑制評分工作有問題的分數。在票據交換所中調整了元數據之后，我們可以為下游管道重新實現模型并解決長期存在的問題。

時間點快照 (Point-In-Time Snapshots)

Political data folks are fanatics for historical analysis, and model estimates can be a key part of that. With the datetime_approved and datetime_deprecated fields, we can reconstruct a model version’s ‘current’ scores from any point in time just as we materialize ‘current’ tables. This can give us a quick view into a model’s predictions over time, or even a snapshot of predictions as of a past Election Day.

政治數據人員是歷史分析的狂熱者，模型估計可能是其中的關鍵部分。使用datetime_approved和datetime_deprecated字段，我們可以在實現“當前”表的同時從任何時間點重建模型版本的“當前”分數。這可以使我們快速查看模型隨時間變化的預測，甚至可以追溯到過去選舉日的預測快照。

結論 (Conclusion)

Our investment in this infrastructure has allowed us to develop and deliver data science products that are more transparent and sustainable than ever before, and we’ll continue building on this infrastructure for the 2022 election cycle and beyond.

我們在基礎設施方面的投資使我們能夠開發和交付比以往任何時候都更加透明和可持續的數據科學產品，并且我們將在2022年選舉周期及以后的基礎設施上繼續建設。

Interested in making a difference? Join our team.

有興趣改變嗎？加入我們的團隊。

翻譯自: https://medium.com/democratictech/our-data-science-clearinghouse-e9f12fd4a86

前置交換機數據交換

總結

以上是生活随笔為你收集整理的前置交换机数据交换_我们的数据科学交换所的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：女人梦到自己大肚子怀孕了是什么意思
下一篇：量子相干与量子纠缠_量子分类