airflow使用_使用AirFlow,SAS Viya和Docker像Pro一样自动化ML模型
airflow使用
Ok, here’s a scenario: You’re the lone data scientist/ML Engineer in your consumer-focused unicorn startup, and you have to build a bunch of models for a variety of different business use cases. You don’t have time to sit around and sulk about the nitty-gritty details of any one model. So you’ve got choices to make. Decisions. Decisions that make you move fast, learn faster, and yet build for resilience all while gaining a unique appreciation for walking the talk. IF you do this right (even partly), you end up becoming walking gold for your company. A unicorn at a unicorn 😃. Why? Because, you put the customer feedback you observed through their data-trail back to work for your company, instead of letting it rot in the dark rooms of untapped logs and data dungeons (a.k.a. databases). These micro-decisions you enable matter. They eventually add up to push your company beyond the inflection point that is needed for exponential growth.
?K,這里有一個場景 :你在你的消費者為中心的麒麟啟動孤獨的數據科學家/ ML工程師,你必須建立一堆模型,各種不同的業務用例。 您沒有時間閑逛并為任何一個模型的實質細節details之以鼻。 這樣您就可以做出選擇了。 決定。 決策使您能夠快速行動,更快地學習,并為恢復能力而努力,同時獲得了對演講的獨特贊賞。 如果您正確(甚至部分地)做到這一點,您最終將成為貴公司的黃金。 獨角獸中的獨角獸😃。 為什么? 因為,您將通過數據線索觀察到的客戶反饋重新投入到為公司工作中,而不是讓它在未使用的日志和數據副本(也稱為數據庫)的暗室中腐爛。 這些使您事半功倍的決定。 他們最終加起來,使您的公司超越了指數式增長所需的拐點。
So, that is where we start from. And build. We’ll assume we can choose tech that simplifies everything for us, yet letting us automate all we want. When in doubt, we’ll simplify — remove, until we can rationalize effort for the outcome to avoid over-engineering stuff. That is exactly what I’ve done for us here — so we don’t get stuck in an analysis/choice paralysis.
因此,這就是我們的出發點。 并建立。 我們假設我們可以選擇可以簡化我們所有工作的技術,但可以讓我們自動化所有需要的技術。 如有疑問,我們將進行簡化-刪除,直到我們可以合理化結果的工作量,以避免過度設計。 這正是我在這里為我們所做的-因此我們不會陷入分析/選擇麻痹的境地。
Note, everything we’ll use here will be assumed to be running on docker unless mentioned otherwise. So based on that we’ll use …
注意,除非另有說明,否則我們將在此處使用的所有內容都假定在docker上運行。 因此,基于此,我們將使用...
Apache Airflow for orchestrating our workflow: Airflow has quickly become the de-facto standard for authoring, scheduling, monitoring, and managing workflows — especially in data pipelines. We know that today, at least, 350 companies use Airflow in the broader tech industry along with a variety of executors and operators including kubernetes and docker.
Apache Airflow 用于協調我們的工作流程: Airflow已Swift成為用于創作,調度,監視和管理工作流程的事實標準,尤其是在數據管道中。 我們知道,今天,至少有350家公司以及更廣泛的執行程序和操作員(包括kubernetes和docker)在更廣泛的技術行業中使用Airflow 。
The usual suspects in the python ecosystem: for glue code, data engineering etc. The one notable addition would be vaex for processing large parquet files quickly and doing some data prep work.
python生態系統中 常見的嫌疑人 :用于粘合代碼,數據工程等。其中一個值得注意的添加是vaex,用于快速處理大型木地板文件并進行一些數據準備工作。
Viya in a container & Viya as an Enterprise Analytics Platform(EAP): SAS Viya is an exciting technology platform that can be used to quickly build business focused capabilities on top of foundational analytical and AI models that SAS produces. We’ll use two flavors of SAS Viya — one as a container for building and running our models, and another one running on Virtual Machine(s) which acts as the enterprise analytics platform that the rest of our organization uses to perform analytics, consume reports, track and monitor models etc. For our specific use case, we’ll use the SAS platform’s autoML capabilities via the DataSciencePilot action set so that we can go full auto-mode on our problem.
容器中的Viya和作為企業分析平臺(EAP)的 Viya:SAS Viya是一個令人興奮的技術平臺,可用于在SAS產生的基礎分析和AI模型之上快速構建以業務為中心的功能。 我們將使用兩種SAS Viya: 一種是用于構建和運行模型的容器 ,另一種是在虛擬機上運行,??該虛擬機充當企業的分析平臺 ,我們組織的其余部分用來執行分析,使用報告,跟蹤和監視模型等。對于我們的特定用例,我們將通過DataSciencePilot操作集使用SAS平臺的autoML功能,以便我們可以對問題進行全自動模式。
SAS Model Manager to inventory, track, & deploy models : This is the model management component on the Viya Enterprise Analytics Platform that we’ll use to eventually push the model out to the wild for scoring.
SAS模型管理器 以庫存,跟蹤和部署模型 :這是Viya Enterprise Analytics Platform上的模型管理組件,我們將使用該組件最終將模型推向野外進行評分。
Now that we’ve lined up all the basic building blocks, let’s address the business problem: We’re required to build a churn detection service so that our fictitious unicorn can detect potential churners and follow up with some remedial course of action to keep them engaged, instead of trying to reactivate them after the window of opportunity lapses. Because we plan to use Viya’s DataSciencePilot action set for training our model, we can simply prep the data and pass it off to dsautoml action which, as it turns out, is just a regular method call using the python-swat package. If you have access to Viya, you should try this out if you haven’t already.
現在,我們已經排列了所有基本的構建基塊,讓我們解決業務問題 :我們需要構建流失檢測服務,以便我們虛構的獨角獸可以檢測到潛在的流失并采取一些補救措施來保持參與,而不是嘗試在機會窗口消失后重新激活它們。 因為我們計劃使用Viya的DataSciencePilot操作集來訓練我們的模型,所以我們可以簡單地準備數據并將其傳遞給dsautoml操作 ,事實證明,這只是使用python-swat包的常規方法調用。 如果您可以訪問Viya,則應該嘗試一下。
Also, if you didn’t pick up on it yet, we’re trying to automate everything including the model (re-)training process for models developed with autoML. We want to do this at a particular cadence as we fully expect to create fresh/new models whenever possible to keep up with the changing data. So automating autoML. Like Inception…😎
另外,如果您還沒有掌握它,我們將嘗試使包括使用autoML開發的模型的模型(重新)訓練過程自動化。 我們希望以特定的節奏進行此操作,因為我們完全希望盡可能地創建新的/新的模型,以跟上不斷變化的數據。 因此,自動化autoML。 像盜夢空間一樣...😎
Anyway, remember: You’re the lone warrior in the effort to spawn artificial intelligence & release it into the back office services clan that attack emerging customer data and provide relevant micro-decisions. So there’s not much time to waste. So, let’s start.
無論如何,請記住:您是唯一的勇士,努力產生人工智能并將其發布到后臺服務部門,以攻擊新興的客戶數據并提供相關的微決策。 因此,沒有太多時間可以浪費。 所以,讓我們開始吧。
We’ll use a little Makefile to start our containers (see below) — it just runs a little script that starts up the containers by setting up the right params and triggering the right flags when ‘docker run’ is called. Nothing extraordinary, but gets the job done.
我們將使用一個小的Makefile來啟動我們的容器(見下文)—它只是運行一個小的腳本,該腳本通過設置正確的參數并在調用“ docker run”時觸發正確的標志來啟動容器。 沒什么特別的,但是可以完成工作。
Start the containers for model development啟動用于模型開發的容器Now, just like that, we’ve got our containers live and kicking. Once we have our notebook environment, we can call autoML via the dsautoml action after loading our data. The syntactic specifics of this action are available here. Very quickly, a sample method call looks like this :
現在,就像這樣,我們使容器處于活動狀態。 有了筆記本環境后,可以在加載數據后通過dsautoml操作調用autoML。 此操作的語法細節在此處提供 。 很快,示例方法調用如下所示:
# sess is the session context for the CAS session.sess.datasciencepilot.dsautoml(table = out, target = "CHURN",inputs = effect_vars,
transformationPolicy={"missing":True,"cardinality":True,
"entropy":True, "iqv":True,
"skewness":True, "kurtosis":True,"outlier":True},
modelTypes = ["decisionTree", "GRADBOOST"],
objective = "AUC",
sampleSize = 20,
topKPipelines = 10,
kFolds = 2,
transformationout = dict(name="TRANSFORMATION_OUT", replace=True),
featureout = dict(name="FEATURE_OUT", replace=True),
pipelineout = dict(name="PIPELINE_OUT", replace=True),
savestate = dict(modelNamePrefix='churn_model', replace = True))
I’ve placed the entire notebook in this repo for you to take a look, so worry not! This particular post isn’t about the specifics of dsautoml. If you are looking for a good intro to automl, you can head over here. You’ll be convinced.
我已將整個筆記本放入此存儲庫中供您查看,所以請不要擔心! 這篇特別的文章與dsautoml的細節無關。 如果您正在尋找有關automl的良好介紹,則可以轉到此處 。 您會被說服的。
As you will see, SAS DataSciencePilot (autoML) provides fully-automated capabilities that allow for multiple actions including automatic feature generation via the feature machine, which auto resolves the transformations needed and then using those features for constructing multiple pipelines in a full-on leaderboard challenge. Additionally, the dsautoml method call also produces two binary files. One for capturing the feature transformations that are performed and then another one for the top model. This means we get the score code for the champion and the feature transformations, so that we can deploy them easily into production. This is VERY important. In a commercial use case such as this one, model deployment is more important than development.
正如您將看到的,SAS DataSciencePilot(autoML)提供了全自動功能,該功能允許多種操作,包括通過功能機自動生成功能,該功能自動解析所需的轉換,然后使用這些功能在一個完整的排行榜中構建多個管道挑戰。 此外,dsautoml方法調用還會生成兩個二進制文件。 一個用于捕獲執行的特征轉換,然后另一個用于捕獲頂級模型。 這意味著我們可以獲得冠軍和功能轉換的得分代碼,以便我們可以輕松地將其部署到生產中。 這個非常重要。 在這樣的商業用例中,模型部署比開發更重要。
If your models don’t get deployed, even the best of them perish doing nothing. And when that is the deal, even a 1-year old will pick something over nothing.
如果您的模型沒有部署,那么即使是最好的模型也無濟于事。 而當那筆交易達成時,即使是一歲的孩子,也總會收獲一無所有。
what your response SHOULDN’T be to how many ML models do you actually deploy?您實際上應該部署多少個ML模型,您的React應該是什么?This mandates us to always choose tools and techniques that meet the ask, and potentially increase the range of deployable options while avoiding re-work. In other words, your tool and model should be able to meet the acceptable scoring SLA of the workload for the business case. And you should know this before you start writing a single line of code. If this doesn’t happen, then any code we write is wasteful and meets no purpose other than satisfying personal fancies.
這要求我們始終選擇能夠滿足要求的工具和技術,并有可能增加可部署選項的范圍,同時避免返工 。 換句話說,您的工具和模型應該能夠滿足業務案例工作負載可接受的評分SLA。 在開始編寫一行代碼之前,您應該知道這一點。 如果這沒有發生,那么我們編寫的任何代碼都是浪費,除了滿足個人幻想之外,沒有其他目的。
So, now that we have a way to automatically train these models on our data, let’s get this autoML process deployed for automatic retraining. This is where Airflow will help us immensely. Why? When we hand off “retraining” to production — there are bunch of new requirements that pop up such as:
因此,既然我們有了一種可以在我們的數據上自動訓練這些模型的方法,那么讓我們為自動重新訓練部署此autoML流程。 這是Airflow將極大地幫助我們的地方。 為什么? 當我們將“再培訓”交給生產時,會彈出很多新要求,例如:
- Error handling — How many times to retry? What happens if there is a failure? 錯誤處理-重試多少次? 如果發生故障怎么辦?
- Quick and easy access to consolidated logs 快速輕松地訪問合并日志
- Task Status Tracking 任務狀態跟蹤
- Ability to re-process historic data due to upstream changes 由于上游變化,能夠重新處理歷史數據
- Execution Dependencies on other processes: For example, process Y needs to run after process X, but what if X does not finish on-time? 對其他流程的執行依賴關系:例如,流程Y需要在流程X之后運行,但是如果X不能按時完成怎么辦?
- Tracing Changes in the Automation Process Definition Files 跟蹤自動化過程定義文件中的更改
Airflow handles all of the above elegantly. And not just that! We can quickly set up airflow on containers, and run it using docker-compose using this repo. Obviously you can edit the Dockerfile or the compose file as you see fit. Once again, I’ve edited these files to suit my needs and dropped it in this repo so you can follow along if you need to. At this point when you run docker-compose you should see postgres and airflow web server running
氣流可以優雅地處理上述所有問題。 不僅如此! 我們可以快速在容器上設置氣流,并使用docker-compose通過此repo運行它。 顯然,您可以根據需要編輯Dockerfile或compose文件。 再次,我已經編輯了這些文件以滿足我的需要,并將其放在此存儲庫中,以便您可以根據需要進行操作。 此時,當您運行docker-compose時,應該看到postgres和airflow Web服務器正在運行
Next, let’s look at the Directed Acyclic Graph (DAG) we’ll use to automatically rebuild this churn detection model weekly. Don’t worry, this DAG is also provided in the same repo.
接下來,讓我們看一下有向無環圖(DAG),我們將使用它每周自動重建這種流失檢測模型。 不用擔心,該DAG也在同一倉庫中提供 。
ML DAG set up to run weeklyML DAG設置為每周運行Now, we’ll click into the graph view and understand what the DAG is trying to accomplish step by step.
現在,我們將單擊進入圖形視圖并了解DAG試圖逐步完成的任務。
Airflow DAG for automating autoML and registering models to SAS Model ManagerAirflow DAG,用于自動執行autoML并將模型注冊到SAS Model ManagerAnd that’s it! Our process is ready to be put to test!
就是這樣! 我們的過程已準備就緒,可以進行測試!
A sample post-run gantt chart運行后甘特圖樣本When the process finishes successfully, all the tasks should report success and the gantt chart view in airflow should resolve to something that looks like the one above (execution times will obviously be different). And just like that, we’ve gotten incredibly close to the finish line.
當過程成功完成時,所有任務均應報告成功,并且氣流中的甘特圖視圖應解析為類似于上圖的視圖(執行時間顯然會有所不同)。 就像那樣,我們已經非常接近終點線。
We’ve just automated our entire training process! Including saving our models for deployment and sending emails out whenever the DAG is run. If you look back, our original goal was to deploy these models as consumable services. We could’ve easily automated that part as well, but our choice of technology (SAS model manager in this case) allows us to add additional touch points, if you so desire. It normally makes sense to have a human-in-the-middle “push button” process before engaging in model publish activities, because it factors in buffers if upstream processes go wonky for reasons like crappy data, sudden changes in baseline distributions etc. More importantly, pushing models to production should actively bring in conscious human mindfulness to the activity. Surely, we wouldn’t want an out-of-sight process impacting the business wildly. Doing this ‘human-in-the-middle’ activity also significantly eliminates the unnecessary need to engage in post-hoc explanations as backtesting comes to the fore.
我們剛剛完成了整個培訓過程! 包括保存我們的部署模型并在運行DAG時發送電子郵件。 如果回頭看,我們的最初目標是將這些模型部署為消耗性服務。 我們也可以很容易地使那部分自動化,但是我們的技術選擇(在這種情況下為SAS模型管理器)允許我們根據需要添加其他接觸點。 通常,在進行模型發布活動之前先進行中間人“按鈕”處理是有意義的,因為如果上游處理由于數據不足,基線分布突然變化等原因而變得不穩定,則會考慮到緩沖區。重要的是,將模型推向生產階段應積極地將人類的正念帶入活動中。 當然,我們不希望視線外的過程對業務產生巨大影響。 進行這種“中間人 ”活動還可以顯著消除不必要的事后解釋,因為回測即將到來。
Ok, lets see how all of this works real quick:
好的,讓我們快速了解所有這些工作原理:
Deploying our autoML model部署我們的autoML模型Notice that SAS Model Manager is able to take the model artifacts and publish them out as a module in a micro analytic service, where models can be consumed using scoring endpoints. And just like that, you are able to flip the switch on your models and make them respond to requests for inference.
注意, SAS Model Manager能夠獲取模型工件并將其作為模塊發布到微分析服務中,在該服務中,可以使用計分端點來使用模型。 這樣,您就可以翻轉模型上的開關,并使它們響應推理請求。
There’s obviously no CI/CD component here just yet. That’s intentional. I didn’t want to overcomplicate this post since all we have is just a model here. I’ll come back and write a follow up on that topic on another day, at a later time, with another app. But for now, let’s rejoice in how much we’ve managed to get done automagically with Airflow & SAS Viya in containers.
顯然,這里還沒有CI / CD組件。 那是故意的。 我不想讓這篇文章過于復雜,因為我們這里只是一個模型。 我將在第二天回來,稍后再通過另一個應用編寫關于該主題的后續報告。 但是現在,讓我們為使用Airflow和SAS Viya容器自動完成的工作量感到高興。
Through thoughtful intelligent automation of mundane routines, using properly selected technology components, you can now make yourself available to focus on more exciting, cooler, higher order projects, while still making an ongoing impact in your unicorn organization through your models. Your best life is now. So why wait, when you can automate? 🤖
通過使用適當選擇的技術組件,通過周到的例行程序智能自動化,您現在可以使自己專注于更令人興奮,更酷,更高階的項目,同時仍通過模型對獨角獸組織產生持續影響。 你現在最好的生活。 那么,為什么要等到什么時候才能實現自動化呢? 🤖
Connect with Sathish on LinkedIn
在LinkedIn上與Sathish聯系
翻譯自: https://medium.com/swlh/automating-your-ml-models-like-a-pro-using-airflow-sas-viya-docker-6abe324d9072
airflow使用
總結
以上是生活随笔為你收集整理的airflow使用_使用AirFlow,SAS Viya和Docker像Pro一样自动化ML模型的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux0.11内核--缓冲区机制大致
- 下一篇: 迁移学习 nlp_NLP的发展-第3部分