當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop学习笔记(8) ——实战做个倒排索引

發(fā)布時間：2025/5/22 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop学习笔记(8) ——实战做个倒排索引小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Hadoop學習筆記(8)

——實戰(zhàn) 做個倒排索引

倒排索引是文檔檢索系統(tǒng)中最常用數(shù)據(jù)結(jié)構(gòu)。根據(jù)單詞反過來查在文檔中出現(xiàn)的頻率，而不是根據(jù)文檔來，所以稱倒排索引(Inverted Index)。結(jié)構(gòu)如下:

這張索引表中，每個單詞都對應著一系列的出現(xiàn)該單詞的文檔，權(quán)表示該單詞在該文檔中出現(xiàn)的次數(shù)?，F(xiàn)在我們假定輸入的是以下的文件清單：

T1 ： hello world hello china

T2 : hello hadoop

T3 ： bye world bye hadoop bye bye

輸入這些文件，我們最終將會得到這樣的索引文件：

bye????T3:4;

china????T1:1;

hadoop????T2:1;T3:1;

hello????T1:2;T2:1;

world????T1:1;T3:1;

接下來，我們就是要想辦法利用hadoop來把這個輸入，變成輸出。從上一章中，其實也就是分析如何將hadoop中的步驟個性化，讓其工作。整個步驟中，最主要的還是map和reduce過程，其它的都可稱之為配角，所以我們先來分析下map和reduce的過程將會是怎樣？

首先是Map的過程。Map的輸入是文本輸入，一條條的行記錄進入。輸出呢？應該包含：單詞、所在文件、單詞數(shù)。 Map的輸入是key-value。那這三個信息誰是key，誰是value呢？數(shù)量是需要累計的，單詞數(shù)肯定在value里，單詞在key中，文件呢？不同文件內(nèi)的相同單詞也不能累加的，所以這個文件應該在key中。這樣key中就應該包含兩個值：單詞和文件，value則是默認的數(shù)量1，用于后面reduce來進行合并。

所以Map后的結(jié)果應該是這樣的：

Key value

Hello;T1 1

Hello:T1 1

World:T1 1

China:T1 1

Hello:T2 1

…

即然這個key是復合的，所以常歸的類型已經(jīng)不能滿足我們的要求了，所以得設(shè)置一個復合健。復合健的寫法在上一章中描述到了。所以這里我們就直接上代碼：

public static class MyType implements WritableComparable<MyType>{

??????public MyType(){

??????}

??????private String word;

??????public String Getword(){return word;}

??????public void Setword(String value){ word = value;}

??????private String filePath;

??????public String GetfilePath(){return filePath;}

??????public void SetfilePath(String value){ filePath = value;}

??????@Override

??????public void write(DataOutput out) throws IOException {

?????????out.writeUTF(word);

?????????out.writeUTF(filePath);

??????}

??????@Override

??????public void readFields(DataInput in) throws IOException {

?????????word = in.readUTF();

?????????filePath = in.readUTF();

??????}

??????@Override

??????public int compareTo(MyType arg0) {

????????????if (word != arg0.word)

???????????????return word.compareTo(arg0.word);

?????????return filePath.compareTo(arg0.filePath);

??????}

}

有了這個復合健的定義后，這個Map函數(shù)就好寫了：

public static class InvertedIndexMapper extends

?????????Mapper<Object, Text, MyType, Text> {

??????public void map(Object key, Text value, Context context)

????????????throws InterruptedException, IOException {

?????????FileSplit split = (FileSplit) context.getInputSplit();

?????????StringTokenizer itr = new StringTokenizer(value.toString());

?????????while (itr.hasMoreTokens()) {

????????????MyType keyInfo = new MyType();

????????????keyInfo.Setword(itr.nextToken());

????????????keyInfo.SetfilePath(split.getPath().toUri().getPath().replace("/user/zjf/in/", ""));

????????????context.write(keyInfo, new Text("1"));

?????????}

??????}

???}

注意：第13行，路徑是全路徑的，為了看起來方便，我們把目錄替換掉，直接取文件名。

有了Map，接下來就可以考慮Recude了，以及在Map之后的Combine。Map的輸出的Key類型是MyType，所以Reduce以及Combine的輸入就必須是MyType了。

如果直接將Map的結(jié)果送到Reduce后，發(fā)現(xiàn)還需要做大量的工作來將Key中的單詞再重排一下。所以我們考慮在Reduce前加一個Combine，先將數(shù)量進行一輪合并。

這個Combine將會輸入下面的值：

Key value

bye????T3:4;

china????T1:1;

hadoop????T2:1;

hadoop????T3:1;

hello????T1:2;

hello????T2:1;

world????T1:1;

world????T3:1;

代碼如下：

public static class InvertedIndexCombiner extends

?????????Reducer<MyType, Text, MyType, Text> {

??????public void reduce(MyType key, Iterable<Text> values, Context context)

????????????throws InterruptedException, IOException {

?????????int sum = 0;

?????????for (Text value : values) {

????????????sum += Integer.parseInt(value.toString());

?????????}

?????????context.write(key, new Text(key.GetfilePath()+ ":" + sum));

??????}

???}

有了上面Combine后的結(jié)果，再進行Reduce就容易了，只需要將value結(jié)果進行合并處理：

public static class InvertedIndexReducer extends

?????????Reducer<MyType, Text, Text, Text> {

??????public void reduce(MyType key, Iterable<Text> values, Context context)

????????????throws InterruptedException, IOException {

?????????Text result = new Text();

?????????String fileList = new String();

?????????for (Text value : values) {

????????????fileList += value.toString() + ";";

?????????}

?????????result.set(fileList);

?????????context.write(new Text(key.Getword()), result);

??????}

???}

經(jīng)過這個Reduce處理，就得到了下面的結(jié)果：

bye????T3:4;

china????T1:1;

hadoop????T2:1;T3:1;

hello????T1:2;T2:1;

world????T1:1;T3:1;

最后，MapReduce函數(shù)都寫完后，就可以掛在Job中運行了。

public static void main(String[] args) throws IOException,

?????????InterruptedException, ClassNotFoundException {

??????Configuration conf = new Configuration();

??????System.out.println("url:" + conf.get("fs.default.name"));

??????Job job = new Job(conf, "InvertedIndex");

??????job.setJarByClass(InvertedIndex.class);

??????job.setMapperClass(InvertedIndexMapper.class);

??????job.setMapOutputKeyClass(MyType.class);

??????job.setMapOutputValueClass(Text.class);

??????job.setCombinerClass(InvertedIndexCombiner.class);

??????job.setReducerClass(InvertedIndexReducer.class);

??????job.setOutputKeyClass(Text.class);

??????job.setOutputValueClass(Text.class);

??????Path path = new Path("out");

??????FileSystem hdfs = FileSystem.get(conf);

??????if (hdfs.exists(path))

?????????hdfs.delete(path, true);

??????FileInputFormat.addInputPath(job, new Path("in"));

??????FileOutputFormat.setOutputPath(job, new Path("out"));

??????job.waitForCompletion(true);

}

注：這里為了調(diào)試方便，我們把in和out都寫死，不用傳入執(zhí)行參數(shù)了，并且，每次執(zhí)行前，判斷out文件夾是否存在，如果存在則刪除。

轉(zhuǎn)載于:https://www.cnblogs.com/zjfstudio/p/3913549.html

總結(jié)

以上是生活随笔為你收集整理的Hadoop学习笔记(8) ——实战做个倒排索引的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： android:imeOptions属性
下一篇： SCOM 2012知识分享-26：分布式

编程问答

Hadoop学习笔记(8) ——实战 做个倒排索引

總結(jié)

Hadoop学习笔记(8) ——实战做个倒排索引