當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

《Hadoop实战》的笔记-2、Hadoop输入与输出

發(fā)布時(shí)間：2023/12/2 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了《Hadoop实战》的笔记-2、Hadoop输入与输出小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

from: http://book.douban.com/annotation/17068812/

這一問題本書只在第三章簡單說了一下讀寫HDFS，雖然能說明問題，但是本著第一遍讀書應(yīng)該把書讀厚的原則，我覺得很有必要自行展開一番。再說凡是萬變不離其宗嘛，任何程序都是從“輸入-->計(jì)算-->輸出”。先說輸入，Hadoop的默認(rèn)的輸入方式是將輸入的每一行視為一條記錄，該行文件偏移量為key，內(nèi)容為value。這樣當(dāng)然不一定能滿足所有的業(yè)務(wù)需要。因此，一方面Hadoop也提供了很多其他的輸入格式，另一方面，更自由的，提供了自定義方式。先擺出幾個(gè)概念：InputFiles : 這個(gè)好說，簡單。InputFormat : 這個(gè)得說說，雖然也簡單，這個(gè)接口(Java interface)決定了Mapper實(shí)例將從Hadoop框架中得到什么樣的數(shù)據(jù)，即什么樣的Key-ValueInputSplit : 這個(gè)在應(yīng)用里不會直接接觸到，但是這個(gè)概念值得了解，YDN上有這么一段話：（注：以下標(biāo)為原文是為了在日記中進(jìn)行突出顯示，非原文字句，請作者及讀者見諒，如果存在版權(quán)問題請指出~）Another important job of the InputFormat is to divide the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks. These fragments are called "splits" and are encapsulated in instances of the InputSplit interface. 一般說來，InputSplit決定了每個(gè)Mapper要處理的數(shù)據(jù)集；而InputFormat則決定了每一個(gè)Split里面的數(shù)據(jù)格式/數(shù)據(jù)結(jié)構(gòu)；不知道這樣一說有沒有說清楚，大體可以理解為InputSplit是物理性的輸入，InputFormat是邏輯性的輸入。Hadoop系統(tǒng)提供以下幾種：（注：以下標(biāo)為原文是為了在日記中進(jìn)行突出顯示，非原文字句，請作者及讀者見諒，如果存在版權(quán)問題請指出~） TextInputFormat：文件偏移量：整行數(shù)據(jù)KeyValueTextInputFormat：第一個(gè)"\t"前的數(shù)據(jù) ：后面的整行數(shù)據(jù)SequenceFileInputFormat：因?yàn)檫@是二進(jìn)制文件，所以Key-Value都是由用戶指定NLineInputFormat：與TextInputFormat一樣，就是NLine的區(qū)別了標(biāo)準(zhǔn)的InputFormat接口如下：

public interface InputFormat<K, V> 

{

	InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;

	RecordReader<K, V> getRecordReader(InputSplit split,

	                                                            JobConf job,

	                                                            Reporter reporter) throws IOException;

}

如果要自定制輸入，就是要繼承這個(gè)接口。兩個(gè)函數(shù)分別的用途是：■ Identify all the files used as input data and divide them into input splits. Eachmap task is assigned one split.■ Provide an object (RecordReader) to iterate through records in a given split,and to parse each record into key and value of predefined types.根據(jù)本書的建議，如果一定要自定制輸入，最好派生FileInputFormat，而不是直接實(shí)現(xiàn)InputFormat接口，原因是對于getSplits()方法，它已經(jīng)實(shí)現(xiàn)好了，足夠絕大多數(shù)實(shí)際開發(fā)的需求。下面給出一個(gè)例子：假設(shè)你的輸入數(shù)據(jù)格式是這樣的：ball, 3.5, 12.7, 9.0car, 15, 23.76, 42.23device, 0.0, 12.4, -67.1每個(gè)點(diǎn)的名字，后面是在坐標(biāo)系里面的坐標(biāo)值。

/* 僅僅實(shí)現(xiàn)了getRecordReader()方法 */

public class ObjectPositionInputFormat extends FileInputFormat<Text, Point3D> {

  public RecordReader<Text, Point3D> getRecordReader(InputSplit input,  JobConf job, Reporter reporter) throws IOException {

		reporter.setStatus(input.toString());

		return new ObjPosRecordReader(job, (FileSplit)input);

}

}

/* 下面是實(shí)現(xiàn)了ObjPosRecordReader類 */

class ObjPosRecordReader implements RecordReader<Text, Point3D> {

	private LineRecordReader lineReader;

	private LongWritable lineKey;

	private Text lineValue;

	public ObjPosRecordReader(JobConf job, FileSplit split) throws IOExpection {

		lineReader = new LineRecordReader(job, conf);

		lineKey = lineReader.createKey();

		lineValue = lineReader.createValue();

}

	public boolean next(Text Key, Point3D value) throws IOEcpection {

		if(!lineReader.next(lineKey, lineValue)){

			return false;

}

		String[] pieces = lineValue.toString().split(",");

		if(pieces.length != 4) {

			throw new IOExpection("Invalid record received");

}

		float fx, fy, fz;

		try {

			fx = Float.parseFloat(pieces[1].trim());

			fy = Float.parseFloat(pieces[2].trim());

			fz = Float.parseFloat(pieces[3].trim());

		} catch(NumberFormatExecption nfe) {

			throw new IOException("Error parsing floating point value in record");

}

		key.set(pieces[0].trim());

		value.x = fx;

		value.y = fy;

		value.z = fz;

		return true;

}

	public Text createKey() {

		return new Text("");

}

	public Text createValue() {

		return new Point3D();

}

	public long getPos() throws IOExpection {

		return lineReader.getPos();

}

	public void close() throws IOExpection {

		lineReader.close();

}

	public float getProgress() throws IOExpection {

		return lineReader.getProcess();

}

}

關(guān)于輸出，一般都是對輸出格式進(jìn)行控制，比如要輸出XML或是JSON類型等等，這一部分不說了，少敲幾個(gè)字，因?yàn)榭傮w與輸入差不多。

總結(jié)

以上是生活随笔為你收集整理的《Hadoop实战》的笔记-2、Hadoop输入与输出的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： MATLAB中用FDATool设计滤波器
下一篇：第一篇博客测试