當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

大数据Map Reduce 和 MPP数据库的区别

發(fā)布時(shí)間：2024/1/1 数据库 25 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据Map Reduce 和 MPP数据库的区别小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

總結(jié)來說MR是一個(gè)編程模型，你可以用MR這個(gè)編程模型自己實(shí)現(xiàn)MPP所做的事。

MPP則是一種SQL的計(jì)算引擎。

“MR分而治之的策略” 和 “Massively Parallel Processor類型的數(shù)據(jù)庫” （即大規(guī)模并行處理數(shù)據(jù)庫，典型代表 AWS Redshift?和 Teradata?以及微軟的 Azure SQL Data Warehouse) 有什么區(qū)別呢?

MPP在數(shù)據(jù)存儲(chǔ)的時(shí)候就根據(jù)指定的分布方式交給不同節(jié)點(diǎn)儲(chǔ)存,?查詢的時(shí)候各塊的節(jié)點(diǎn)有獨(dú)立的計(jì)算資源分別處理,?然后匯總到一個(gè)leader?node(又叫control?node),?具體的優(yōu)化和傳統(tǒng)的關(guān)系型數(shù)據(jù)庫很相似,?涉及到了索引,?統(tǒng)計(jì)信息等概念. MPP有shared?everything /Disk / Nothing之別.

舉例來說說區(qū)別:

比如一張銷售表,?其中有一列產(chǎn)品類別,?現(xiàn)在要知道各個(gè)產(chǎn)品類別的總銷量.

類別a	1
類別a	2
類別b	3
類別b	1
類別c	4

MR處理方法:?在map階段, 將不同的類別的數(shù)據(jù)map到不同的slaver處理，在統(tǒng)計(jì)各個(gè)類別總銷量,?然后shuffle到?reduce節(jié)點(diǎn)階段生成結(jié)果。

MPP處理方法:?每個(gè)節(jié)點(diǎn)統(tǒng)計(jì)各自存儲(chǔ)的數(shù)據(jù)中各個(gè)類別總銷量,?匯總結(jié)果到leader?node,?leader做個(gè)合并,在這個(gè)案例里就是做幾次加法

可以看到在這個(gè)場(chǎng)景中MPP的效率絕對(duì)比MR高的多,?MPP的每個(gè)節(jié)點(diǎn)直接處理存儲(chǔ)在自身上的數(shù)據(jù)，而MR則有一個(gè)網(wǎng)絡(luò)分發(fā)的過程。

在實(shí)際應(yīng)用中的確MPP有更高的效率,?所以對(duì)于結(jié)構(gòu)化的大數(shù)據(jù), MPP至今仍是首選.?

正如開頭所說，MR的意義在于它是編程模型，相比MPP，它們能處理更加復(fù)雜的問題。

MR能自己開發(fā)Map、Reduce的代碼，不受SQL語法限制。相比于SQL屈指可數(shù)的幾個(gè)聚合函數(shù)（sum, avg, max, rank...），MR能隨心所欲地開發(fā)出各種處理數(shù)據(jù)的邏輯，把數(shù)據(jù)分布到不通的節(jié)點(diǎn)上去計(jì)算。

以下為轉(zhuǎn)載:

Shared Everything:一般是針對(duì)單個(gè)主機(jī)，完全透明共享CPU/MEMORY/IO，并行處理能力是最差的，典型的代表SQLServer

Shared Disk：各個(gè)處理單元使用自己的私有 CPU和Memory，共享磁盤系統(tǒng)。典型的代表Oracle Rac，它是數(shù)據(jù)共享，可通過增加節(jié)點(diǎn)來提高并行處理的能力，擴(kuò)展能力較好。其類似于SMP（對(duì)稱多處理）模式，但是當(dāng)存儲(chǔ)器接口達(dá)到飽和的時(shí)候，增加節(jié)點(diǎn)并不能獲得更高的性能?。

Shared Nothing：各個(gè)處理單元都有自己私有的CPU/內(nèi)存/硬盤等，不存在共享資源，類似于MPP（大規(guī)模并行處理）模式，各處理單元之間通過協(xié)議通信，并行處理和擴(kuò)展能力更好。典型代表DB2 DPF和hadoop ，各節(jié)點(diǎn)相互獨(dú)立，各自處理自己的數(shù)據(jù)，處理后的結(jié)果可能向上層匯總或在節(jié)點(diǎn)間流轉(zhuǎn)。

我們常說的?Sharding 其實(shí)就是Share Nothing架構(gòu)，它是把某個(gè)表從物理存儲(chǔ)上被水平分割，并分配給多臺(tái)服務(wù)器（或多個(gè)實(shí)例），每臺(tái)服務(wù)器可以獨(dú)立工作，具備共同的schema，比如MySQL Proxy和Google的各種架構(gòu)，只需增加服務(wù)器數(shù)就可以增加處理能力和容量。

首先MPP 必須消除手工切分?jǐn)?shù)據(jù)的工作量。這是MySQL 在互聯(lián)網(wǎng)應(yīng)用中的主要局限性。?
　　

另外MPP 的切分必須在任何時(shí)候都是平均的，不然某些節(jié)點(diǎn)處理的時(shí)間就明顯多于另外一些節(jié)點(diǎn)。

對(duì)于工作負(fù)載是不是要平均分布有同種和異種之分，同種就是所有節(jié)點(diǎn)在數(shù)據(jù)裝載的時(shí)候都同時(shí)轉(zhuǎn)載，異種就是可以指定部分節(jié)點(diǎn)專門用來裝載數(shù)據(jù)（邏輯上的不是物理上），而其他所有節(jié)點(diǎn)用來負(fù)責(zé)查詢。 Aster Data 和Greenplum 都屬于這種。兩者之間并沒有明顯的優(yōu)勢(shì)科研，同種的工作負(fù)載情況下，需要軟件提供商保證所有節(jié)點(diǎn)的負(fù)載是平衡的。而異種的工作負(fù)載可以在你覺得數(shù)據(jù)裝載很慢的情況下手工指定更多節(jié)點(diǎn)裝載數(shù)據(jù) 。區(qū)別其實(shí)就是自動(dòng)化和手工控制，看個(gè)人喜好而已。?
　　?
　　

另外一個(gè)問題是查詢?nèi)绾伪怀跏蓟摹?比如要查詢銷售最好的10件商品，每個(gè)節(jié)點(diǎn)都要先計(jì)算出自己的最好的10件商品，然后向上匯總，匯總的過程，肯定有些節(jié)點(diǎn)做的工作比其他節(jié)點(diǎn)要多。

上面只是一個(gè)簡單的單表查詢，如果是兩個(gè)表的連接查詢，可能還會(huì)涉及到節(jié)點(diǎn)之間計(jì)算的中間過程如何傳遞的問題。是將大表和小表都平均分布，然后節(jié)點(diǎn)計(jì)算的時(shí)候?qū)⒌玫降慕Y(jié)果匯總（可能要兩次匯總），還是將大表平均分布，小表的數(shù)據(jù)傳輸給每個(gè)節(jié)點(diǎn)，這樣匯總就只需要一次。（其中一個(gè)特例可以參考后面給出的Oracle Partition Wise Join）。兩種執(zhí)行計(jì)劃很難說誰好誰壞，數(shù)據(jù)量的大小可能會(huì)產(chǎn)生不同的影響。有些特定的廠商專門對(duì)這種執(zhí)行計(jì)劃做過了優(yōu)化的，比如EMC Greenplum 和 HP Vertica 。這其中涉及到很多取舍問題，比如數(shù)據(jù)分布模式，數(shù)據(jù)重新分布的成本，中間交換數(shù)據(jù)的網(wǎng)卡速度，儲(chǔ)存介質(zhì)讀寫的速度和數(shù)據(jù)量大小（計(jì)算過程一般都會(huì)用臨時(shí)表儲(chǔ)存中間過程）。?

轉(zhuǎn)載部分的原文鏈接:

Shared Everything和share-nothing區(qū)別_seteor的專欄-CSDN博客

數(shù)據(jù)倉庫技術(shù)中的MPP_fengyuruhui123的博客-CSDN博客_mpp數(shù)據(jù)倉庫

下面這段描述了MPP(Azure Data Warehouse)中怎么把一張大表分布到各個(gè)節(jié)點(diǎn)上(Dedicated SQL pool (formerly SQL DW) architecture - Azure Synapse Analytics | Microsoft Docs):?

Hash-distributed tables(轉(zhuǎn)者注:?在可能經(jīng)常要filter或join的列上用hash來分布)

A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.

To shard data into a hash-distributed table, SQL Data Warehouse uses a hash function to deterministically assign each row to one distribution. In the table definition, one of the columns is designated as the distribution column. The hash function uses the values in the distribution column to assign each row to a distribution.

The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.

Each row belongs to one distribution.
A deterministic hash algorithm assigns each row to one distribution.
The number of table rows per distribution varies as shown by the different sizes of tables.

There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that run on the system.

Round-robin distributed tables (轉(zhuǎn)者注:?純隨機(jī)分布)

A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.

A round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require reshuffling data and this takes additional time.

Replicated Tables?(轉(zhuǎn)者注:?這就類似hadoop中的分布式緩存)

A replicated table provides the fastest query performance for small tables.

A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best utilized with small tables. Extra storage is required and there are additional overheads that are incurred when writing data which make large tables impractical.

The following diagram shows a replicated table. For SQL Data Warehouse, the replicated table is cached on the first distribution on each compute node.

下面在這篇文章里對(duì)MR的解釋很好,?從原理的角度出發(fā),?map?reduce其實(shí)就是二分查找的一個(gè)逆過程,?不過因?yàn)橛?jì)算節(jié)點(diǎn)有限,?所以map和reduce前都預(yù)先有一個(gè)分區(qū)的步驟.?二分查找要求數(shù)據(jù)是排序好的,?所以Map Reduce之間會(huì)有一個(gè)shuffle的過程對(duì)Map的結(jié)果排序. Reduce的輸入是排好序的.

大數(shù)據(jù)數(shù)據(jù)庫：MPP vs MapReduce_Dreamy_LIN的博客-CSDN博客_mpp數(shù)據(jù)庫

總結(jié)

以上是生活随笔為你收集整理的大数据Map Reduce 和 MPP数据库的区别的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： FastClick 填坑及源码解析
下一篇： linux cmake编译源码,linu