當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

apache hadoop_使用Apache Hadoop计算PageRanks

發(fā)布時間：2023/12/3 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 apache hadoop_使用Apache Hadoop计算PageRanks 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

apache hadoop

目前，我正在接受Coursera的培訓(xùn)“ 挖掘海量數(shù)據(jù)集 ”。我對MapReduce和Apache Hadoop感興趣已有一段時間了，通過本課程，我希望對何時以及如何MapReduce可以幫助解決一些現(xiàn)實世界中的業(yè)務(wù)問題有更多的了解（我在這里介紹了另一種解決方法）。該Coursera課程主要側(cè)重于使用算法的理論，而較少涉及編碼本身。第一周是關(guān)于PageRanking以及Google如何使用它來對頁面進(jìn)行排名。幸運的是，與Hadoop結(jié)合可以找到很多關(guān)于該主題的信息。我到這里結(jié)束并決定仔細(xì)看一下這段代碼。

我所做的就是獲取這段代碼（將其分叉）并重新編寫了一下。我創(chuàng)建的映射器單元測試和減速器跟我描述這里。作為測試用例，我使用了課程中的示例。我們有三個相互鏈接和/或彼此鏈接的網(wǎng)頁：

此鏈接方案應(yīng)解析為以下頁面排名：

Y 7/33
5/33
M 21/33

由于MapReduce示例代碼期望輸入“ Wiki頁面” XML ，因此我創(chuàng)建了以下測試集：

原始頁面本身已經(jīng)很好地解釋了它的全局工作方式。我將僅描述我創(chuàng)建的單元測試。有了原始的解釋和我的單元測試，您應(yīng)該能夠解決問題并了解發(fā)生了什么。

如上所述，整個工作分為三個部分：

解析

計算

訂購

在解析部分中，將原始XML提取，分割成多個頁面并進(jìn)行映射，以便我們獲得該頁面作為鍵和具有出站鏈接的頁面值作為輸出。因此，單元測試的輸入將是三個“ Wiki”頁面XML，如上所示。預(yù)期帶有鏈接頁面的頁面的“標(biāo)題”。單元測試如下：

package net.pascalalma.hadoop.job1;...public class WikiPageLinksMapperTest {MapDriver<LongWritable, Text, Text, Text> mapDriver;String testPageA = " <page>\n" +" <title>A</title>\n" +" ..." +" <text xml:space=\"preserve\" bytes=\"6523\">[[Y]] [[M]]</text>\n" +" </revision>";String testPageY = " <page>\n" +" <title>Y</title>\n" +" ..." +" <text xml:space=\"preserve\" bytes=\"6523\">[[A]] [[Y]]</text>\n" +" </revision>\n" +" </page>";String testPageM = " <page>\n" +" <title>M</title>\n" +" ..." +" <text xml:space=\"preserve\" bytes=\"6523\">[[M]]</text>\n" +" </revision>\n" +" </page>";@Beforepublic void setUp() {WikiPageLinksMapper mapper = new WikiPageLinksMapper();mapDriver = MapDriver.newMapDriver(mapper);}@Testpublic void testMapper() throws IOException {mapDriver.withInput(new LongWritable(1), new Text(testPageA));mapDriver.withInput(new LongWritable(2), new Text(testPageM));mapDriver.withInput(new LongWritable(3), new Text(testPageY));mapDriver.withOutput(new Text("A"), new Text("Y"));mapDriver.withOutput(new Text("A"), new Text("M"));mapDriver.withOutput(new Text("Y"), new Text("A"));mapDriver.withOutput(new Text("Y"), new Text("Y"));mapDriver.withOutput(new Text("M"), new Text("M"));mapDriver.runTest(false);} }

映射器的輸出將成為我們的reducer的輸入。那個的單元測試如下：

package net.pascalalma.hadoop.job1; ... public class WikiLinksReducerTest {ReduceDriver<Text, Text, Text, Text> reduceDriver;@Beforepublic void setUp() {WikiLinksReducer reducer = new WikiLinksReducer();reduceDriver = ReduceDriver.newReduceDriver(reducer);}@Testpublic void testReducer() throws IOException {List<Text> valuesA = new ArrayList<Text>();valuesA.add(new Text("M"));valuesA.add(new Text("Y"));reduceDriver.withInput(new Text("A"), valuesA);reduceDriver.withOutput(new Text("A"), new Text("1.0\tM,Y"));reduceDriver.runTest();} }

如單元測試所示，我們期望reducer將輸入減少到“初始”頁面等級1.0的值，該等級與（關(guān)鍵）頁面具有傳出鏈接的所有頁面連接。這是該階段的輸出，將用作“計算”階段的輸入。
在計算部分中，將對進(jìn)入的頁面等級進(jìn)行重新計算，以實現(xiàn)“ 冪迭代 ”方法。將多次執(zhí)行此步驟，以獲得給定頁面集的可接受頁面排名。如前所述，上一步的輸出是該步驟的輸入，正如我們在此映射器的單元測試中所看到的：

package net.pascalalma.hadoop.job2; ... public class RankCalculateMapperTest {MapDriver<LongWritable, Text, Text, Text> mapDriver;@Beforepublic void setUp() {RankCalculateMapper mapper = new RankCalculateMapper();mapDriver = MapDriver.newMapDriver(mapper);}@Testpublic void testMapper() throws IOException {mapDriver.withInput(new LongWritable(1), new Text("A\t1.0\tM,Y"));mapDriver.withInput(new LongWritable(2), new Text("M\t1.0\tM"));mapDriver.withInput(new LongWritable(3), new Text("Y\t1.0\tY,A"));mapDriver.withOutput(new Text("M"), new Text("A\t1.0\t2"));mapDriver.withOutput(new Text("A"), new Text("Y\t1.0\t2"));mapDriver.withOutput(new Text("Y"), new Text("A\t1.0\t2"));mapDriver.withOutput(new Text("A"), new Text("|M,Y"));mapDriver.withOutput(new Text("M"), new Text("M\t1.0\t1"));mapDriver.withOutput(new Text("Y"), new Text("Y\t1.0\t2"));mapDriver.withOutput(new Text("A"), new Text("!"));mapDriver.withOutput(new Text("M"), new Text("|M"));mapDriver.withOutput(new Text("M"), new Text("!"));mapDriver.withOutput(new Text("Y"), new Text("|Y,A"));mapDriver.withOutput(new Text("Y"), new Text("!"));mapDriver.runTest(false);} }

源頁面中說明了此處的輸出。 “額外”項目帶有“！” 和'|' 在減少步驟中對于計算是必需的。減速器的單元測試如下：

package net.pascalalma.hadoop.job2; ... public class RankCalculateReduceTest {ReduceDriver<Text, Text, Text, Text> reduceDriver;@Beforepublic void setUp() {RankCalculateReduce reducer = new RankCalculateReduce();reduceDriver = ReduceDriver.newReduceDriver(reducer);}@Testpublic void testReducer() throws IOException {List<Text> valuesM = new ArrayList<Text>();valuesM.add(new Text("A\t1.0\t2"));valuesM.add(new Text("M\t1.0\t1"));valuesM.add(new Text("|M"));valuesM.add(new Text("!"));reduceDriver.withInput(new Text("M"), valuesM);List<Text> valuesA = new ArrayList<Text>();valuesA.add(new Text("Y\t1.0\t2"));valuesA.add(new Text("|M,Y"));valuesA.add(new Text("!"));reduceDriver.withInput(new Text("A"), valuesA);List<Text> valuesY = new ArrayList<Text>();valuesY.add(new Text("Y\t1.0\t2"));valuesY.add(new Text("|Y,A"));valuesY.add(new Text("!"));valuesY.add(new Text("A\t1.0\t2"));reduceDriver.withInput(new Text("Y"), valuesY);reduceDriver.withOutput(new Text("A"), new Text("0.6\tM,Y"));reduceDriver.withOutput(new Text("M"), new Text("1.4000001\tM"));reduceDriver.withOutput(new Text("Y"), new Text("1.0\tY,A"));reduceDriver.runTest(false);} }

如圖所示，映射器的輸出被重新創(chuàng)建為輸入，我們檢查reducer的輸出是否與頁面等級計算的第一次迭代相匹配。每次迭代將導(dǎo)致相同的輸出格式，但可能具有不同的頁面等級值。
最后一步是“訂購”部分。這非常簡單，單元測試也是如此。這部分僅包含一個映射器，該映射器獲取上一步的輸出并將其“重新格式化”為所需格式：pagerank +按pagerank的頁面順序。當(dāng)將映射器結(jié)果提供給化簡器步驟時，按鍵排序是由Hadoop框架完成的，因此該排序不會反映在Mapper單元測試中。此單元測試的代碼是：

package net.pascalalma.hadoop.job3; ... public class RankingMapperTest {MapDriver<LongWritable, Text, FloatWritable, Text> mapDriver;@Beforepublic void setUp() {RankingMapper mapper = new RankingMapper();mapDriver = MapDriver.newMapDriver(mapper);}@Testpublic void testMapper() throws IOException {mapDriver.withInput(new LongWritable(1), new Text("A\t0.454545\tM,Y"));mapDriver.withInput(new LongWritable(2), new Text("M\t1.90\tM"));mapDriver.withInput(new LongWritable(3), new Text("Y\t0.68898\tY,A"));//Please note that we cannot check for ordering here because that is done by Hadoop after the Map phasemapDriver.withOutput(new FloatWritable(0.454545f), new Text("A"));mapDriver.withOutput(new FloatWritable(1.9f), new Text("M"));mapDriver.withOutput(new FloatWritable(0.68898f), new Text("Y"));mapDriver.runTest(false);} }

因此，在這里，我們只檢查映射器是否接受輸入并正確格式化輸出。

總結(jié)了單元測試的所有示例。通過這個項目，您應(yīng)該能夠自己進(jìn)行測試，并且對原始代碼的工作方式有更深入的了解。它肯定有助于我理解它！

包括單元測試在內(nèi)的完整代碼版本可以在這里找到。

翻譯自: https://www.javacodegeeks.com/2015/02/calculate-pageranks-apache-hadoop.html

apache hadoop

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯，堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

總結(jié)

以上是生活随笔為你收集整理的apache hadoop_使用Apache Hadoop计算PageRanks的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux的vi命令大全（vi linu
下一篇： jvmti_JVMTI标记如何影响GC暂