数据治理 主数据 元数据_我们对数据治理的误解
數(shù)據(jù)治理 主數(shù)據(jù) 元數(shù)據(jù)
Data governance is top of mind for many of my customers, particularly in light of GDPR, CCPA, COVID-19, and any number of other acronyms that speak to the increasing importance of data management when it comes to protecting user data.
數(shù)據(jù)治理是我許多客戶的首要考慮因素,尤其是考慮到GDPR,CCPA,COVID-19以及任何其他首字母縮寫詞,這些首字母縮寫詞表明了數(shù)據(jù)管理在保護(hù)用戶數(shù)據(jù)方面的重要性日益提高。
Over the past several years, data catalogs have emerged as a powerful tool for data governance, and I couldn’t be happier. As companies digitize and their data operations democratize, it’s important for all elements of the data stack, from warehouses to business intelligence platforms, and now, catalogs, to participate in compliance best practices.
在過去的幾年中, 數(shù)據(jù)目錄已成為一種強(qiáng)大的數(shù)據(jù)治理工具 ,我對(duì)此感到高興。 隨著公司數(shù)字化及其數(shù)據(jù)運(yùn)營(yíng)的民主化,從倉(cāng)庫(kù)到商業(yè)智能平臺(tái),再到現(xiàn)在的目錄,數(shù)據(jù)堆棧的所有元素都必須參與合規(guī)性最佳實(shí)踐。
But are data catalogs all we need to build a robust data governance program?
但是,構(gòu)建強(qiáng)大的數(shù)據(jù)治理程序所需的所有數(shù)據(jù)目錄都是嗎?
數(shù)據(jù)目錄用于數(shù)據(jù)治理? (Data catalogs for data governance?)
Analogous to a physical library catalog, data catalogs serve as an inventory of metadata and give investors the information necessary to evaluate data accessibility, health, and location. Companies like Alation, Collibra, and Informatica tout solutions that not only keep tabs on your data, but also integrate with machine learning and automation to make data more discoverable, collaborative, and now, in compliance with organizational, industry-wide, or even government regulations.
類似于物理圖書館目錄, 數(shù)據(jù)目錄用作元數(shù)據(jù)清單,并向投資者提供評(píng)估數(shù)據(jù)可訪問性,健康狀況和位置所需的信息。 像Alation,Collibra和Informatica這樣的公司都在宣傳解決方案,這些解決方案不僅可以保留數(shù)據(jù)標(biāo)簽,還可以與機(jī)器學(xué)習(xí)和自動(dòng)化集成,從而使數(shù)據(jù)更易于發(fā)現(xiàn),協(xié)作,并且現(xiàn)在符合組織,整個(gè)行業(yè)甚至政府的要求。規(guī)定。
Since data catalogs provide a single source of truth about a company’s data sources, it’s very easy to leverage data catalogs to manage the data in your pipelines. Data catalogs can be used to store metadata that gives stakeholders a better understanding of a specific source’s lineage, thereby instilling greater trust in the data itself. Additionally, data catalogs make it easy to keep track of where personally identifiable information (PII) can both be housed and sprawl downstream, as well as who in the organization has the permission to access it across the pipeline.
由于數(shù)據(jù)目錄提供有關(guān)公司數(shù)據(jù)源的唯一事實(shí)來源,因此利用數(shù)據(jù)目錄來管理管道中的數(shù)據(jù)非常容易。 數(shù)據(jù)目錄可用于存儲(chǔ)元數(shù)據(jù),從而使利益相關(guān)者更好地了解特定來源的血統(tǒng),從而在數(shù)據(jù)本身上建立起更大的信任。 此外,數(shù)據(jù)目錄使跟蹤個(gè)人身份信息(PII)可以存放和向下游蔓延的位置以及組織中的誰(shuí)有權(quán)通過管道訪問變得容易。
什么適合我的組織? (What’s right for my organization?)
So, what type of data catalog makes the most sense for your organization? To make your life a little easier, I spoke with data teams in the field to learn about their data catalog solutions, breaking them down into three distinct categories: in-house, third-party, and open source.
那么,哪種類型的數(shù)據(jù)目錄最適合您的組織? 為了使您的生活更輕松,我與該領(lǐng)域的數(shù)據(jù)團(tuán)隊(duì)進(jìn)行了交談,以了解他們的數(shù)據(jù)目錄解決方案,并將它們分為三個(gè)不同的類別:內(nèi)部,第三方和開源。
內(nèi)部的 (In-house)
Some B2C companies — I’m talking the Airbnbs, Netflixs, and Ubers of the world — build their own data catalogs to ensure data compliance with state, country, and even economic union (I’m looking at you GDPR) level regulations. The biggest perk of in-house solutions is the ability to quickly spin up customizable dashboards, pulling out fields your team needs the most.
一些B2C公司(我正在談?wù)撊虻腁irbnbs , Netflix和Uber)建立自己的數(shù)據(jù)目錄,以確保數(shù)據(jù)符合州,國(guó)家或經(jīng)濟(jì)聯(lián)盟(我在看您的GDPR)級(jí)法規(guī)。 內(nèi)部解決方案最大的好處是能夠快速啟動(dòng)可定制的儀表板,從而拉出團(tuán)隊(duì)最需要的領(lǐng)域。
Uber’s Databook lets data scientists easily search for tables. Uber的數(shù)據(jù)手冊(cè)可讓數(shù)據(jù)科學(xué)家輕松搜索表格。 Image courtesy of 圖片由 Uber EngineeringUber Engineering提供.。While in-house tools make for quick customization, over time, such hacks can lead to a lack of visibility and collaboration, particularly when it comes to understanding data lineage. In fact, one data leader I spoke with at a food delivery startup noted that what was clearly missing from her in-house data catalog was a “single pane of glass.” If she had a single source of truth that could provide insight into how her team’s tables were being leveraged by other parts of the business, ensuring compliance would be easy.
盡管內(nèi)部工具可以快速進(jìn)行自定義,但隨著時(shí)間的流逝,此類黑客行為可能導(dǎo)致缺乏可見性和協(xié)作性,尤其是在了解數(shù)據(jù)沿襲時(shí)。 實(shí)際上,我在一家食品配送初創(chuàng)公司與之交談的一位數(shù)據(jù)負(fù)責(zé)人指出,她內(nèi)部數(shù)據(jù)目錄中顯然缺少的是“一塊玻璃”。 如果她有一個(gè)真實(shí)的來源,可以洞察業(yè)務(wù)的其他部門如何利用她的團(tuán)隊(duì)的表,那么確保合規(guī)將很容易。
On top of these tactical considerations, spending engineering time and resources building a multi-million dollar data catalog just doesn’t make sense for the vast majority of companies.
除了這些戰(zhàn)術(shù)上的考慮之外,花費(fèi)大量的工程時(shí)間和資源來建立數(shù)百萬(wàn)美元的數(shù)據(jù)目錄對(duì)于絕大多數(shù)公司來說都是沒有意義的。
第三方 (Third-party)
Since their founding in 2012, Alation has largely paved the way for the rise of the automated data catalog. Now, there are a whole host of ML-powered data catalogs on the market, including Collibra, Informatica, and others, many with pay-for-play workflow and repository-oriented compliance management integrations. Some cloud providers, like Google, AWS, and Azure, also offer data governance tooling integration at an additional cost.
自2012年成立以來, Alation在很大程度上為自動(dòng)化數(shù)據(jù)目錄的興起鋪平了道路。 現(xiàn)在,市場(chǎng)上有大量基于ML的數(shù)據(jù)目錄,包括Collibra , Informatica等,其中許多具有按需付費(fèi)工作流程和面向存儲(chǔ)庫(kù)的合規(guī)性管理集成。 一些云提供商,例如Google,AWS和Azure,還提供了額外的數(shù)據(jù)治理工具集成。
In my conversations with data leaders, one downside of these solutions came up time and again: usability. While nearly all of these tools have strong collaboration features, one Data Engineering VP I spoke with specifically called out his third-party catalog’s unintuitive UI.
在與數(shù)據(jù)負(fù)責(zé)人的對(duì)話中,這些解決方案的一個(gè)缺點(diǎn)一次又一次出現(xiàn):可用性。 盡管幾乎所有這些工具都具有強(qiáng)大的協(xié)作功能,但與我交談的一位數(shù)據(jù)工程副總裁特別提到了他的第三方目錄的直觀用戶界面。
If data tools aren’t easy to use, how can we expect users to understand or even care whether they’re compliant?
如果數(shù)據(jù)工具不容易使用,我們?nèi)绾纹谕脩衾斫馍踔陵P(guān)心他們是否合規(guī)?
開源的 (Open source)
In 2017, Lyft became an industry leader by open sourcing their data discovery and metadata engine, Amundsen, named after the famed Antarctic explorer. Other open source tools, such as Apache Atlas, Magda and CKAN, provide similar functionalities, and all three make it easy for development-savvy teams to fork an instance of the software and get started.
2017年,Lyft通過開源其數(shù)據(jù)發(fā)現(xiàn)和元數(shù)據(jù)引擎Amundsen成為行業(yè)領(lǐng)導(dǎo)者, Amundsen以著名的南極探險(xiǎn)家的名字命名。 其他開放源代碼工具(例如Apache Atlas , Magda和CKAN )提供了相似的功能,而這三者使精通開發(fā)的團(tuán)隊(duì)可以輕松地派生該軟件的實(shí)例并開始使用。
Amundsen, an open source data catalog, gives users insight into schema usage. Amundsen是一個(gè)開源數(shù)據(jù)目錄,可讓用戶深入了解架構(gòu)的使用。 Image courtesy of 圖片由 Mikhail IvanovMikhail Ivanov提供.。While some of these tools allow teams to tag metadata within to control user access, this is an intensive and often manual process that most teams just don’t have the time to tackle. In fact, a product manager at a leading transportation company shared that his team specifically chose not to use an open source data catalog because they didn’t have off-the-shelf support for all the data sources and data management tooling in their stack, making data governance extra challenging. In short, open source solutions just weren’t comprehensive enough.
盡管其中一些工具允許團(tuán)隊(duì)在其中標(biāo)記元數(shù)據(jù)來控制用戶訪問,但這是一個(gè)密集且通常是手動(dòng)的過程,大多數(shù)團(tuán)隊(duì)只是沒有時(shí)間解決。 實(shí)際上,一家領(lǐng)先的運(yùn)輸公司的產(chǎn)品經(jīng)理分享說,他的團(tuán)隊(duì)特別選擇不使用開源數(shù)據(jù)目錄,因?yàn)樗麄儧]有對(duì)堆棧中所有數(shù)據(jù)源和數(shù)據(jù)管理工具的現(xiàn)成支持,使數(shù)據(jù)治理更具挑戰(zhàn)性。 簡(jiǎn)而言之,開源解決方案還不夠全面。
Still, there’s something critical to compliance that even the most advanced catalog can’t account for: data downtime.
盡管如此,即使對(duì)于最高級(jí)的目錄,也無法解決合規(guī)性方面的關(guān)鍵問題: 數(shù)據(jù)停機(jī) 。
缺少的鏈接:數(shù)據(jù)停機(jī) (The missing link: data downtime)
Recently, I developed a simple metric for a customer that helps measure data downtime, in other words, periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. When applied to data governance, data downtime gives you a holistic picture of your organization’s data reliability. Without data reliability to power full discoverability, it’s impossible to know whether or not your data is fully compliant and usable.
最近,我為客戶開發(fā)了一個(gè)簡(jiǎn)單的指標(biāo) ,該指標(biāo)可以幫助您衡量數(shù)據(jù)停機(jī)時(shí)間 ,換句話說,就是您的數(shù)據(jù)不完整,錯(cuò)誤,丟失或不準(zhǔn)確時(shí)的時(shí)間段。 當(dāng)應(yīng)用于數(shù)據(jù)治理時(shí),數(shù)據(jù)停機(jī)時(shí)間可以使您全面了解組織的數(shù)據(jù)可靠性。 沒有數(shù)據(jù)可靠性來增強(qiáng)完全可發(fā)現(xiàn)性,就無法知道您的數(shù)據(jù)是否完全合規(guī)和可用。
Data catalogs solve some, but not all, of your data governance problems. To start, mitigating governance gaps is a monumental undertaking, and it’s impossible to prioritize these without a full understanding of which data assets are actually being accessed by your company. Data reliability fills this gap and allows you to unlock your data ecosystem’s full potential.
數(shù)據(jù)目錄解決了部分但不是全部的數(shù)據(jù)治理問題。 首先,減輕治理差距是一項(xiàng)艱巨的任務(wù),如果無法完全了解貴公司實(shí)際上正在訪問哪些數(shù)據(jù)資產(chǎn),就不可能對(duì)這些差距進(jìn)行優(yōu)先排序。 數(shù)據(jù)可靠性填補(bǔ)了這一空白,并允許您釋放數(shù)據(jù)生態(tài)系統(tǒng)的全部潛力。
Additionally, without real-time lineage, it’s impossible to know how PII or other regulated data sprawls. Think about it for a second: even if you’re using the fanciest data catalog on the market, your governance is only as good as your knowledge about where that data goes. If your pipelines aren’t reliable, neither is your data catalog.
此外,如果沒有實(shí)時(shí)沿襲,就不可能知道PII或其他受監(jiān)管的數(shù)據(jù)是如何蔓延的。 仔細(xì)考慮一下:即使您使用的是市場(chǎng)上最高級(jí)的數(shù)據(jù)目錄,您的治理也僅取決于您對(duì)數(shù)據(jù)去向的了解。 如果管道不可靠,那么數(shù)據(jù)目錄也不可靠。
Owing to their complementary features, data catalogs and data reliability solutions work hand-in-hand to provide an engineering approach to data governance, no matter the acronyms you need to meet.
由于具有互補(bǔ)功能,因此數(shù)據(jù)目錄和數(shù)據(jù)可靠性解決方案可以協(xié)同工作,從而為數(shù)據(jù)治理提供一種工程方法,無論您需要使用首字母縮寫詞如何。
Personally, I’m excited for what the next wave of data catalogs have in store. And trust me: it’s more than just data.
就個(gè)人而言,我對(duì)下一波數(shù)據(jù)目錄的存儲(chǔ)感到興奮。 相信我:這不僅僅是數(shù)據(jù)。
If you want to learn more, reach out to Barr Moses.
如果您想了解更多信息,請(qǐng)聯(lián)系 Barr Moses 。
翻譯自: https://towardsdatascience.com/what-we-got-wrong-about-data-governance-365555993048
數(shù)據(jù)治理 主數(shù)據(jù) 元數(shù)據(jù)
總結(jié)
以上是生活随笔為你收集整理的数据治理 主数据 元数据_我们对数据治理的误解的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到一片墓地是啥意思
- 下一篇: 提高机器学习质量的想法_如何提高机器学习