當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

数据科学家数据工程师_数据科学家应该对数据进行版本控制的4个理由

發(fā)布時(shí)間：2023/11/29 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了数据科学家数据工程师_数据科学家应该对数据进行版本控制的4个理由小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)科學(xué)家數(shù)據(jù)工程師

While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

在軟件項(xiàng)目中工作時(shí)，它是非常普遍的，實(shí)際上是立即開(kāi)始版本控制代碼的標(biāo)準(zhǔn)，對(duì)于軟件社區(qū)來(lái)說(shuō)，好處已經(jīng)非常明顯：它跟蹤特定代碼存儲(chǔ)庫(kù)中對(duì)代碼的每次修改。如果有任何錯(cuò)誤，開(kāi)發(fā)人員可以隨時(shí)瀏覽并比較早期版本的代碼，以解決問(wèn)題，同時(shí)最大程度地減少對(duì)所有團(tuán)隊(duì)成員的破壞。軟件項(xiàng)目代碼是最寶貴的資產(chǎn)，因此必須不惜一切代價(jià)保護(hù)它！

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?

好吧，對(duì)于數(shù)據(jù)科學(xué)項(xiàng)目，數(shù)據(jù)也可以被視為皇冠上的明珠，那么為什么我們作為數(shù)據(jù)科學(xué)家不通過(guò)版本控制將其視為地球上最寶貴的東西呢？

For those familiar with Git, you might be thinking, “Git cannot handle large files and directories.. at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

對(duì)于熟悉Git的人來(lái)說(shuō)，您可能會(huì)想： “ Git無(wú)法處理大文件和目錄。至少，它不能具有與處理小代碼文件相同的性能。那么如何以與版本控制代碼相同的舊版本來(lái)控制數(shù)據(jù)呢？”。嗯，這已經(jīng)成為可能，而且很容易，只需鍵入git clone并查看保存在工作區(qū)中的數(shù)據(jù)文件和ML模型文件，并且所有這些魔力都可以通過(guò)DVC來(lái)實(shí)現(xiàn)。

DVC快速入門 (Quick start with DVC)

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

首先，我們必須在計(jì)算機(jī)中安裝DVC。這非常簡(jiǎn)單，您可以按照以下步驟進(jìn)行操作。

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

正如我已經(jīng)提到的那樣，用于數(shù)據(jù)版本控制的工具(例如DVC)使構(gòu)建大型項(xiàng)目成為可能，同時(shí)又可以重現(xiàn)管道。使用DVC，將數(shù)據(jù)集添加到git存儲(chǔ)庫(kù)非常簡(jiǎn)單，而我的意思很簡(jiǎn)單，就像鍵入以下行一樣：

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

無(wú)論數(shù)據(jù)集的大小如何，數(shù)據(jù)都會(huì)添加到存儲(chǔ)庫(kù)中。假設(shè)我們也想將數(shù)據(jù)集推送到云中，也可以使用以下命令：

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

DVC開(kāi)箱即用，支持許多云存儲(chǔ)服務(wù)，例如S3，Google Storage，Azure Blob，Google Drive等。由于數(shù)據(jù)集是通過(guò)版本控制系統(tǒng)推送到云的，因此如果我將項(xiàng)目克隆到另一臺(tái)計(jì)算機(jī)上，我可以使用以下命令下載數(shù)據(jù)或任何其他工件：

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

好了，既然您知道如何開(kāi)始使用DVC，我建議您繼續(xù)研究該工具或類似工具。作為數(shù)據(jù)科學(xué)家，版本控制應(yīng)該是您最好的朋友，因?yàn)樗鼈儾粌H允許版本數(shù)據(jù)集，而且允許創(chuàng)建可復(fù)制的管道，同時(shí)保持所有開(kāi)發(fā)的可追溯性和可復(fù)制性。

If this hasn’t yet convinced, next I’ll tell why you must start versioning control your data!!

如果尚未確定，接下來(lái)我將告訴您為什么必須開(kāi)始版本控制您的數(shù)據(jù)！

為什么要開(kāi)始使用數(shù)據(jù)版本控制？ (Why should I start using data version control?)

1.保存并復(fù)制所有數(shù)據(jù)實(shí)驗(yàn) (1. Save and reproduce all of your data experiments)

As Data Scientists we know that to develop a Machine Learning model, is not all about code, but also about data and the right parameters. A lot of times, in order to find the perfect match, experimentation is required, which makes the process highly iterative and extremely important to keep track of the changes made as well as their impacts on the end results. This becomes even more important in a complex environment where multiple data scientists are collaborating. In that sense, if we are able to have a snapshot of the data used to develop a certain version of the model and have it versioned, it makes the process of iteration and model development not only easier but also trackable.

作為數(shù)據(jù)科學(xué)家，我們知道開(kāi)發(fā)機(jī)器學(xué)習(xí)模型不僅與代碼有關(guān)，而且與數(shù)據(jù)和正確的參數(shù)有關(guān)。很多時(shí)候，為了找到完美的匹配，需要進(jìn)行實(shí)驗(yàn)，這使得該過(guò)程具有高度的重復(fù)性，并且對(duì)于跟蹤所做的更改及其對(duì)最終結(jié)果的影響非常重要。在由多個(gè)數(shù)據(jù)科學(xué)家協(xié)作的復(fù)雜環(huán)境中，這一點(diǎn)變得更加重要。從這個(gè)意義上講，如果我們能夠擁有用于開(kāi)發(fā)模型的特定版本的數(shù)據(jù)的快照并對(duì)其進(jìn)行版本化，那么它不僅使迭代和模型開(kāi)發(fā)過(guò)程變得更加容易而且可跟蹤。

2.調(diào)試和測(cè)試 (2. Debugging and testing)

While playing around in Kaggle competitions many times we do not understand the real challenges inherent to the development of an ML-based solution while working with production systems. In fact, one of the biggest challenges is to deal with the variety of data sources and the amount of data that we’ve available. Sometimes can be a bit daunting to reproduce the results of experimentation if we are not even able to retrieve the exact dataset that has been used. Data version control can ease these issues and make the process of machine learning solutions development must simpler, organized, and reproducible.

當(dāng)多次參加Kaggle比賽時(shí)，我們不了解在與生產(chǎn)系統(tǒng)一起工作時(shí)開(kāi)發(fā)基于ML的解決方案所固有的真正挑戰(zhàn)。實(shí)際上，最大的挑戰(zhàn)之一是處理各種數(shù)據(jù)源和我們可用的數(shù)據(jù)量。如果我們甚至無(wú)法檢索已使用的確切數(shù)據(jù)集，有時(shí)要重現(xiàn)實(shí)驗(yàn)結(jié)果可能會(huì)有些艱巨。數(shù)據(jù)版本控制可以緩解這些問(wèn)題，并使機(jī)器學(xué)習(xí)解決方案的開(kāi)發(fā)過(guò)程必須更簡(jiǎn)單，更有條理并且可重現(xiàn)。

3.合規(guī)與審計(jì) (3. Compliance and auditing)

Privacy regulations, such as GDPR, already request companies and organizations to demonstrate compliance and history of the available data sources. The ability to track data version provided by version control tools is the first step to have companies data sources ready for compliance, and an essential step in maintaining a strong and robust audit train and risk management processes around data.

隱私法規(guī)(例如GDPR)已經(jīng)要求公司和組織證明合規(guī)性和可用數(shù)據(jù)源的歷史記錄。跟蹤版本控制工具提供的數(shù)據(jù)版本的能力是使公司數(shù)據(jù)源準(zhǔn)備好合規(guī)的第一步，并且是維持圍繞數(shù)據(jù)的強(qiáng)大而強(qiáng)大的審核培訓(xùn)和風(fēng)險(xiǎn)管理流程的重要步驟。

4.協(xié)調(diào)軟件和數(shù)據(jù)科學(xué)團(tuán)隊(duì) (4. Align software and data science teams)

Sometimes, to have Data Science and Software teams talking the same language can be quite challenging and can highly depend on the profiles involved in the interactions between the teams. To start implementing some of the good practices from the software into the data science processes, can help not only to align the work between the teams involved, but also to accelerate the development and integration of the solutions.

有時(shí)，讓數(shù)據(jù)科學(xué)和軟件團(tuán)隊(duì)說(shuō)相同的語(yǔ)言可能會(huì)非常具有挑戰(zhàn)性，并且在很大程度上取決于團(tuán)隊(duì)之間交互所涉及的配置文件。從軟件到數(shù)據(jù)科學(xué)流程開(kāi)始實(shí)施一些良好實(shí)踐，不僅可以幫助使相關(guān)團(tuán)隊(duì)之間的工作保持一致，還可以加快解決方案的開(kāi)發(fā)和集成。

結(jié)論 (Conclusions)

Data science is had to productize, and one of the main reasons for that is because there are too many mutable elements, such as data. The concept of versioning for data science applications can be interpreted in many possible ways, from models to data versioning. This article aimed to cover the importance and benefits of versioning data for the data science teams, but there are many more aspects that we should pay attention to as Data Scientists. In the end, keeping an eye on continuous delivery principles is very important for the success of ML-based solutions!

數(shù)據(jù)科學(xué)必須進(jìn)行生產(chǎn)，其主要原因之一是因?yàn)榭勺冊(cè)?例如數(shù)據(jù))太多。從模型到數(shù)據(jù)版本控制，可以采用許多可能的方式來(lái)解釋數(shù)據(jù)科學(xué)應(yīng)用程序的版本控制概念。本文旨在介紹對(duì)數(shù)據(jù)科學(xué)團(tuán)隊(duì)進(jìn)行數(shù)據(jù)版本控制的重要性和好處，但是作為數(shù)據(jù)科學(xué)家，我們還有許多方面應(yīng)注意。最后，密切注意連續(xù)交付原則對(duì)于基于ML的解決方案的成功非常重要！

Fabiana Clemente is CDO at YData.

Fabiana Clemente 是 YData的 CDO 。

Improved data for AI

改善AI數(shù)據(jù)

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.

YData為數(shù)據(jù)科學(xué)家提供了以數(shù)據(jù)為中心的開(kāi)發(fā)平臺(tái)，以處理高質(zhì)量和合成數(shù)據(jù)。

翻譯自: https://medium.com/swlh/4-reasons-why-data-scientists-should-version-data-672aca5bbd0b