當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用ToolRunner运行Hadoop程序基本原理分析

發(fā)布時間：2024/1/23 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了使用ToolRunner运行Hadoop程序基本原理分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

? ? 為了簡化命令行方式運行作業(yè)，Hadoop自帶了一些輔助類。GenericOptionsParser是一個類，用來解釋常用的Hadoop命令行選項，并根據(jù)需要，為Configuration對象設(shè)置相應(yīng)的取值。通常不直接使用GenericOptionsParser，更方便的方式是：實現(xiàn)Tool接口，通過ToolRunner來運行應(yīng)用程序，ToolRunner內(nèi)部調(diào)用GenericOptionsParser。

一、相關(guān)的類及接口解釋
（一）相關(guān)類及其對應(yīng)關(guān)系如下：

關(guān)于ToolRunner典型的實現(xiàn)方法 1、定義一個類（如上圖中的MyClass），繼承configured，實現(xiàn)Tool接口。 2、在main()方法中通過ToolRunner.run(...)方法調(diào)用上述類的run(String[]方法）見第三部分的例子。
（二）關(guān)于ToolRunner 1、ToolRunner與上圖中的類、接口無任何的繼承、實現(xiàn)關(guān)系，它只繼承了Object，沒實現(xiàn)任何接口。 2、ToolRunner可以方便的運行那些實現(xiàn)了Tool接口的類（調(diào)用其run(String[]）方法，并通過GenericOptionsParser?可以方便的處理hadoop命令行參數(shù)。

A utility to help run Tools.

ToolRunner can be used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool. The application-specific options are passed along without being modified.

3、ToolRunner除了一個空的構(gòu)造方法以外，只有一個方法，即run()方法，它有以下2種形式：

run

public static int run(Configuration?conf,Tool?tool,String[]?args)throws Exception

Runs the given?Tool?by?Tool.run(String[]), after parsing with the given generic arguments. Uses the given?Configuration, or builds one if null. Sets the?Tool's configuration with the possibly modified version of the?conf.

Parameters:

conf?-?Configuration?for the?Tool.

tool?-?Tool?to run.

args?- command-line arguments to the tool.

Returns:

exit code of the?Tool.run(String[])?method.

Throws:

Exception

run

public static int run(Tool?tool,String[]?args)throws Exception

Runs the?Tool?with its?Configuration. Equivalent to?run(tool.getConf(), tool, args).

Parameters:

tool?-?Tool?to run.

args?- command-line arguments to the tool.

Returns:

exit code of the?Tool.run(String[])?method.

Throws:

Exception

它們均是靜態(tài)方法，即可以通過類名調(diào)用。

（1）public static int?run(Configuration?conf,Tool?tool,?String[]?args) 這個方法調(diào)用tool的run(String[])方法，并使用conf中的參數(shù)，以及args中的參數(shù)，而args一般來源于命令行。（2）public static int?run(Tool?tool,?String[]?args) 這個方法調(diào)用tool的run方法，并使用tool類的參數(shù)屬性，即等同于run(tool.getConf(), tool, args)。

除此以外，還有一個方法：

static void printGenericCommandUsage(PrintStream out)?
? ? ? ? ? Prints generic command-line argurments and usage information.

4、ToolRunner完成以下2個功能：

（1）為Tool創(chuàng)建一個Configuration對象。

（2）使得程序可以方便的讀取參數(shù)配置。

ToolRunner完整源代碼如下：

package org.apache.hadoop.util;import java.io.PrintStream;import org.apache.hadoop.conf.Configuration;/*** A utility to help run {@link Tool}s.* * <p><code>ToolRunner</code> can be used to run classes implementing * <code>Tool</code> interface. It works in conjunction with * {@link GenericOptionsParser} to parse the * <a href="{@docRoot}/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions">* generic hadoop command line arguments</a> and modifies the * <code>Configuration</code> of the <code>Tool</code>. The * application-specific options are passed along without being modified.* </p>* * @see Tool* @see GenericOptionsParser*/ public class ToolRunner {/*** Runs the given <code>Tool</code> by {@link Tool#run(String[])}, after * parsing with the given generic arguments. Uses the given * <code>Configuration</code>, or builds one if null.* * Sets the <code>Tool</code>'s configuration with the possibly modified * version of the <code>conf</code>. * * @param conf <code>Configuration</code> for the <code>Tool</code>.* @param tool <code>Tool</code> to run.* @param args command-line arguments to the tool.* @return exit code of the {@link Tool#run(String[])} method.*/public static int run(Configuration conf, Tool tool, String[] args) throws Exception{if(conf == null) {conf = new Configuration();}GenericOptionsParser parser = new GenericOptionsParser(conf, args);//set the configuration back, so that Tool can configure itselftool.setConf(conf);//get the args w/o generic hadoop argsString[] toolArgs = parser.getRemainingArgs();return tool.run(toolArgs);}/*** Runs the <code>Tool</code> with its <code>Configuration</code>.* * Equivalent to <code>run(tool.getConf(), tool, args)</code>.* * @param tool <code>Tool</code> to run.* @param args command-line arguments to the tool.* @return exit code of the {@link Tool#run(String[])} method.*/public static int run(Tool tool, String[] args) throws Exception{return run(tool.getConf(), tool, args);}/*** Prints generic command-line argurments and usage information.* * @param out stream to write usage information to.*/public static void printGenericCommandUsage(PrintStream out) {GenericOptionsParser.printGenericCommandUsage(out);}}

（三）關(guān)于Configuration 1、默認(rèn)情況下，hadoop會加載core-default.xml以及core-site.xml中的參數(shù)。

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:

core-default.xml?: Read-only defaults for hadoop.

core-site.xml: Site-specific configuration for a given hadoop installation.

見以下代碼： static{//print deprecation warning if hadoop-site.xml is found in classpathClassLoader cL = Thread.currentThread().getContextClassLoader();if (cL == null) {cL = Configuration.class.getClassLoader();}if(cL.getResource("hadoop-site.xml")!=null) {LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +"Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "+ "mapred-site.xml and hdfs-site.xml to override properties of " +"core-default.xml, mapred-default.xml and hdfs-default.xml " +"respectively");}addDefaultResource("core-default.xml");addDefaultResource("core-site.xml");}Configuration.java的源代碼中包含了以上代碼，即通過靜態(tài)語句為程序加載core-default.xml以及core-site.xml中的參數(shù)。
同時，檢查是否還存在hadoop-site.xml，若還存在，則給出warning，提醒此配置文件已經(jīng)廢棄。如何查找到上述2個文件：（見hadoop命令的腳本）（1）定位HADOOP_CONF_DIR ?Alternate conf dir. Default is ${HADOOP_HOME}/conf. （2）將HADOOP_CONF_DIR加入CLASSPATH="${HADOOP_CONF_DIR}" （3）可以在CLASSPATH中直接查找上述文件。

2、在程序運行時，可以通過命令行修改參數(shù)，修改方法如下
3、Configuration類中有大量的add****,set****,get****方法，用于設(shè)置及獲取參數(shù)。
4、Configuration實現(xiàn)了Iterable<Map.Entry<String,String>>，因此可以通過以下方式對其內(nèi)容進(jìn)行遍歷： for (Entry<String, String> entry : conf)｛ ..... ｝
（四）關(guān)于Tool 1、Tool類的源文件如下 package org.apache.hadoop.util;import org.apache.hadoop.conf.Configurable;public interface Tool extends Configurable {int run(String [] args) throws Exception; }由此可見，Tool自身只有一個方法run(String[])，同時它繼承了Configuable的2個方法。
（五）關(guān)于Configrable與Conifgured 1、Configurable的源文件如下： package org.apache.hadoop.conf;public interface Configurable {void setConf(Configuration conf);Configuration getConf(); } 有2個對于Configuration的set與get方法。

2、Configured的源文件如下： package org.apache.hadoop.conf;public class Configured implements Configurable {private Configuration conf;public Configured() {this(null);}public Configured(Configuration conf) {setConf(conf);}public void setConf(Configuration conf) {this.conf = conf;}public Configuration getConf() {return conf;}} 它有2個構(gòu)造方法，分別是帶Configuration參數(shù)的方法與不還參數(shù)的方法。實現(xiàn)了Configuable中對于Configuration的set與get方法。

二、示例程序一：呈現(xiàn)所有參數(shù) 下面是一個簡單的程序： package org.jediael.hadoopdemo.toolrunnerdemo;import java.util.Map.Entry;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner;public class ToolRunnerDemo extends Configured implements Tool {static {//Configuration.addDefaultResource("hdfs-default.xml");//Configuration.addDefaultResource("hdfs-site.xml");//Configuration.addDefaultResource("mapred-default.xml");//Configuration.addDefaultResource("mapred-site.xml");}@Overridepublic int run(String[] args) throws Exception {Configuration conf = getConf();for (Entry<String, String> entry : conf) {System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());}return 0;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new ToolRunnerDemo(), args);System.exit(exitCode);} }

以上程序用于輸出上述xml文件中定義的屬性。
1、直接運行程序 [root@jediael?project]#hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo
io.seqfile.compress.blocksize=1000000?
keep.failed.task.files=false?
mapred.disk.healthChecker.interval=60000?
dfs.df.interval=60000?
dfs.datanode.failed.volumes.tolerated=0?
mapreduce.reduce.input.limit=-1?
mapred.task.tracker.http.address=0.0.0.0:50060?
mapred.used.genericoptionsparser=true?
mapred.userlog.retain.hours=24?
dfs.max.objects=0?
mapred.jobtracker.jobSchedulable=org.apache.hadoop.mapred.JobSchedulable?
mapred.local.dir.minspacestart=0?
hadoop.native.lib=true ......................
2、通過-D指定新的參數(shù) [root@jediael?project]# hadoop org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo?-D color=yello?| grep color?
color=yello
3、通過-conf增加新的配置文件（1）原有參數(shù)數(shù)量 [root@jediael?project]# hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo | wc ? ? ? ? ? ? ? ?? 67 67 2994 （2）增加配置文件后的參數(shù)數(shù)量 [root@jediael?project]# hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo-conf /opt/jediael/hadoop-1.2.0/conf/mapred-site.xml?| wc?
? ? 68 68 3028 其中mapred-site.xml的內(nèi)容如下：
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>? <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> 可見此文件只有一個property，因此參數(shù)數(shù)量從67個變成了68個。
4、在代碼中增加參數(shù)，如上面程序中注釋掉的語句 static { Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); Configuration.addDefaultResource("mapred-default.xml"); Configuration.addDefaultResource("mapred-site.xml"); } 更多選項請見第Configuration的解釋。

三、示例程序二：典型用法（修改wordcount程序）修改經(jīng)典的wordcount程序，參考：Hadoop入門經(jīng)典:WordCount

package org.jediael.hadoopdemo.toolrunnerdemo;import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner;public class WordCount extends Configured implements Tool{public static class WordCountMap extendsMapper<LongWritable, Text, Text, IntWritable> {private final IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();StringTokenizer token = new StringTokenizer(line);while (token.hasMoreTokens()) {word.set(token.nextToken());context.write(word, one);}}}public static class WordCountReduce extendsReducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}@Overridepublic int run(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMap.class);job.setReducerClass(WordCountReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));return(job.waitForCompletion(true)?0:-1);}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new WordCount(), args);System.exit(exitCode);}} 運行程序： [root@jediael project]# hadoop fs -mkdir wcin2 [root@jediael project]# hadoop fs -copyFromLocal /opt/jediael/apache-nutch-2.2.1/CHANGES.txt wcin2 [root@jediael project]# hadoop jar wordcount2.jar org.jediael.hadoopdemo.toolrunnerdemo.WordCount wcin2 wcout2 由上可見，關(guān)于ToolRunner的典型用法是： 1、定義一個類，繼承Configured，實現(xiàn)Tool接口。其中Configured提供了getConf()與setConfig()方法，而Tool則提供了run()方法。 2、在main()方法中通過ToolRunner.run(...)方法調(diào)用上述類的run(String[]方法）。

四、總結(jié) 1、通過使用ToolRunner.run(...)方法，可以更便利的使用hadoop命令行參數(shù)。 2、ToolRunner.run(...)通過調(diào)用Tool類中的run(String[])方法來運行hadoop程序，并默認(rèn)加載core-default.xml與core-site.xml中的參數(shù)。

總結(jié)

以上是生活随笔為你收集整理的使用ToolRunner运行Hadoop程序基本原理分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Hadoop入门经典:WordCount
下一篇：【Nutch2.2.1源代码分析之4】N