當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

开源数据仓库_使用这些开源工具进行数据仓库

發(fā)布時(shí)間：2023/11/29 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了开源数据仓库_使用这些开源工具进行数据仓库小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

開源數(shù)據(jù)倉庫

by Simon Sp?ti

西蒙·斯派蒂(SimonSp?ti)

使用這些開源工具進(jìn)行數(shù)據(jù)倉庫 (Use these open-source tools for Data Warehousing)

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

如今，每個人都在談?wù)撻_源軟件。但是，這在數(shù)據(jù)倉庫(DWH)字段中仍然不常見。為什么是這樣？

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

在這篇文章中，我選擇了一些開源技術(shù)，并將它們一起用于構(gòu)建數(shù)據(jù)倉庫系統(tǒng)的完整數(shù)據(jù)體系結(jié)構(gòu)。

I went with Apache Druid for data storage, Apache Superset for querying, and Apache Airflow as a task orchestrator.

我使用Apache Druid進(jìn)行數(shù)據(jù)存儲，使用Apache Superset進(jìn)行查詢，并使用Apache Airflow作為任務(wù)編排器。

德魯伊—數(shù)據(jù)存儲 (Druid — the data store)

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.

Druid是一個用Java編寫的開源，面向列的分布式數(shù)據(jù)存儲。它旨在快速提取大量事件數(shù)據(jù)，并在數(shù)據(jù)之上提供低延遲查詢。

為什么要使用德魯伊？ (Why use Druid?)

Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost effectiveness.

Druid具有許多關(guān)鍵功能，包括亞秒級OLAP查詢，實(shí)時(shí)流接收，可伸縮性和成本效益。

With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.

考慮到現(xiàn)代OLAP技術(shù)的比較，我選擇了Druid而不是ClickHouse，Pinot和Apache Kylin。最近， Microsoft宣布將把Druid添加到其Azure HDInsight 4.0中。

為什么不德魯伊？ (Why not Druid?)

Carter Shanklin wrote a detailed post about Druid’s limitations at Horthonwork.com. The main issue is with its support for SQL joins, and advanced SQL capabilities.

Carter Shanklin在Horthonwork.com上寫了一篇有關(guān)Druid局限性的詳細(xì)文章。主要問題是它對SQL連接的支持以及高級SQL功能。

德魯伊的體系結(jié)構(gòu) (The Architecture of Druid)

Druid is scalable due to its cluster architecture. You have three different node types — the Middle-Manager-Node, the Historical Node and the Broker.

由于其集群體系結(jié)構(gòu)，Druid可擴(kuò)展。您有三種不同的節(jié)點(diǎn)類型-中間管理者節(jié)點(diǎn)，歷史節(jié)點(diǎn)和代理。

The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.

很棒的是，您可以在最適合您的特定區(qū)域中添加任意數(shù)量的節(jié)點(diǎn)。如果要運(yùn)行許多查詢，則可以添加更多代理。或者，如果需要批量處理大量數(shù)據(jù)，則可以添加中層管理人員，依此類推。

A simple architecture is shown below. You can read more about Druid’s design here.

一個簡單的架構(gòu)如下所示。您可以在此處閱讀有關(guān)Druid設(shè)計(jì)的更多信息。

Apache Superset —用戶界面 (Apache Superset — the UI)

The easiest way to query against Druid is through a lightweight, open-source tool called Apache Superset.

針對Druid進(jìn)行查詢的最簡單方法是通過一個名為Apache Superset的輕量級開源工具。

It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more.

它易于使用，并具有所有常見的圖表類型，例如氣泡圖，字?jǐn)?shù)統(tǒng)計(jì)，熱圖，箱線圖等等。

Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.

Druid提供了Rest-API，并且在最新版本中還提供了SQL Query API。這使得可以輕松使用任何工具，無論它是標(biāo)準(zhǔn)SQL，任何現(xiàn)有的BI工具還是自定義應(yīng)用程序。

Apache Airflow-協(xié)調(diào)器 (Apache Airflow — the Orchestrator)

As mentioned in Orchestrators — Scheduling and monitor workflows, this is one of the most critical decisions.

如Orchestrators中的“計(jì)劃和監(jiān)視工作流”中所述，這是最關(guān)鍵的決定之一。

In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.

過去，ETL工具(例如Microsoft SQL Server集成服務(wù)(SSIS)和其他工具)得到了廣泛使用。它們是您進(jìn)行數(shù)據(jù)轉(zhuǎn)換，清理和標(biāo)準(zhǔn)化的地方。

In more modern architectures, these tools aren’t enough anymore.

在更現(xiàn)代的體系結(jié)構(gòu)中，這些工具已經(jīng)遠(yuǎn)遠(yuǎn)不夠了。

Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.

而且，代碼和數(shù)據(jù)轉(zhuǎn)換邏輯對于公司中其他精通數(shù)據(jù)的人來說更有價(jià)值。

I highly recommend you read a blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing. This goes much deeper into how modern data pipelines should be.

我強(qiáng)烈建議您閱讀Maxime Beauchemin的博客文章有關(guān)功能數(shù)據(jù)工程(一種用于批處理數(shù)據(jù)的現(xiàn)代范例) 。這將更深入地介紹現(xiàn)代數(shù)據(jù)管道的方式。

Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

另外，請考慮閱讀《數(shù)據(jù)工程師的垮臺》一書，其中Max解釋了打破“數(shù)據(jù)孤島”等問題。

為什么要使用氣流？ (Why use Airflow?)

Apache Airflow is a very popular tool for this task orchestration. Airflow is written in Python. Tasks are written as Directed Acyclic Graphs (DAGs). These are also written in Python.

Apache Airflow是用于此任務(wù)編排的非常流行的工具。氣流是用Python編寫的。任務(wù)被編寫為有向無環(huán)圖( DAG )。這些也是用Python編寫的。

Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.

無需將關(guān)鍵轉(zhuǎn)換邏輯封裝在工具中的任何位置，而是將其放置在Orchestrator內(nèi)部的位置。

Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.

另一個優(yōu)點(diǎn)是使用普通的Python。無需封裝其他依賴項(xiàng)或要求，例如從FTP提取，將數(shù)據(jù)從A復(fù)制到B，編寫批處理文件。您可以執(zhí)行此操作，其他所有操作都在同一位置。

氣流特征 (Features of Airflow)

Moreover, you get a fully functional overview of all current tasks in one place.

此外，您可以在一處獲得所有當(dāng)前任務(wù)的完整功能概述。

More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.

Airflow的更多相關(guān)功能是您像編寫程序一樣編寫工作流。諸如Databricks，Spark等的外部作業(yè)沒有問題。

Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.

作業(yè)測試通過Airflow本身進(jìn)行。這包括將參數(shù)傳遞給下游的其他作業(yè)，或驗(yàn)證Airflow上正在運(yùn)行的內(nèi)容并查看實(shí)際代碼。日志文件和其他元數(shù)據(jù)可通過Web GUI訪問。

(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.

僅在部分工作流程上運(yùn)行(重新)，并且相關(guān)任務(wù)是一項(xiàng)至關(guān)重要的功能，當(dāng)您使用Airflow創(chuàng)建工作流程時(shí)，該功能即開即用。作業(yè)/任務(wù)在上下文中運(yùn)行，調(diào)度程序傳遞必要的詳細(xì)信息，然后工作將在任務(wù)級別(而不是DAG級別)上跨集群分布。

For many more feature visit the full list.

有關(guān)更多功能，請?jiān)L問完整列表。

使用Apache Airflow的ETL (ETL with Apache Airflow)

If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, and Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.

如果要以Apache Airflow作為新的ETL工具開始，請從與您共享的Airflow的ETL最佳實(shí)踐開始。它具有簡單的ETL示例，帶有簡單SQL，帶有HIVE ，帶有Data Vault ， Data Vault 2和帶有大數(shù)據(jù)流程的Data Vault 。它為您提供了一個很好的概述，介紹了可行的方法以及如何實(shí)現(xiàn)它。

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from here.

同時(shí)，您可以使用一個Docker容器，這意味著您甚至不必設(shè)置任何基礎(chǔ)架構(gòu)。您可以從此處拉出容器。

For the GitHub-repo follow the link on etl-with-airflow.

對于GitHub-repo，請點(diǎn)擊etl-with-airflow上的鏈接。

結(jié)論 (Conclusion)

If you’re searching for open-source data architecture, you cannot ignore Druid for speedy OLAP responses, Apache Airflow as an orchestrator that keeps your data lineage and schedules in line, plus an easy to use dashboard tool like Apache Superset.

如果您正在尋找開源數(shù)據(jù)架構(gòu)，則不能忽略Druid的快速OLAP響應(yīng)，Apache Airflow作為協(xié)調(diào)器(使您的數(shù)據(jù)沿襲和時(shí)間表保持一致)以及易于使用的儀表板工具(如Apache Superset)。

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at Impy which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.

到目前為止，我的經(jīng)驗(yàn)是Druid的速度非常快，并且以傳統(tǒng)方式非常適合OLAP多維數(shù)據(jù)集替換，但是仍然需要更輕松的啟動來安裝集群，提取數(shù)據(jù)，查看日志等。如果需要，請看看由Druid的創(chuàng)始人創(chuàng)建的Impy 。它圍繞您需要的Druid創(chuàng)建所有服務(wù)。不幸的是，它不是開源的。

Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

在傳統(tǒng)的商業(yè)智能環(huán)境中，Apache Airflow及其作為協(xié)調(diào)器的功能尚未發(fā)生很多事情。我相信，當(dāng)您開始使用開源和更多新技術(shù)時(shí)，這種變化會自然而然地出現(xiàn)。

And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Apache Superset是一種簡便，快速的方法，可用于啟動和運(yùn)行以及顯示來自Druid的數(shù)據(jù)。那里有更好的工具，例如Tableau等，但不是免費(fèi)的。這就是為什么如果您已經(jīng)在使用上述開源技術(shù)，那么Superset非常適合生態(tài)系統(tǒng)。但是作為一家企業(yè)公司，您可能需要在該類別中花一些錢，因?yàn)檫@是用戶最終可以看到的。

總結(jié)

以上是生活随笔為你收集整理的开源数据仓库_使用这些开源工具进行数据仓库的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： nodejs调试ndb_如何开始使用ND
下一篇：汉堡菜单_开发人员在编写汉堡菜单时犯的错