當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用ToolRunner运行Hadoop作业的原理及用法

發布時間：2024/1/23 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了使用ToolRunner运行Hadoop作业的原理及用法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用ToolRunner運行Hadoop作業的原理及用法

@(HADOOP)[hadoop, 大數據]

使用ToolRunner運行Hadoop作業的原理及用法
一示例程序一打印所有參數
- - 1直接運行程序
  - 2通過-D指定新的參數
  - 3通過-conf增加新的配置文件
  - 4在代碼中增加參數如上面程序中注釋掉的語句
二示例程序二典型用法修改wordcount程序
三相關的類及接口解釋
- 一相關類及其對應關系
- 二關于ToolRunner
- 三關于Configuration
- 四關于Tool
四總結

為了簡化命令行方式運行作業，方便的調整運行參數，Hadoop自帶了一些輔助類。GenericOptionsParser是一個類，用來解釋常用的Hadoop命令行選項，并根據需要，為Configuration對象設置相應的取值。通常不直接使用GenericOptionsParser，更方便的方式是：實現Tool接口，通過ToolRunner來運行應用程序，ToolRunner內部調用GenericOptionsParser。

一、示例程序一：打印所有參數

下面是一個簡單的程序：

package org.jediael.hadoopdemo.toolrunnerdemo;import java.util.Map.Entry;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner;public class ToolRunnerDemo extends Configured implements Tool {static {//Configuration.addDefaultResource("hdfs-default.xml");//Configuration.addDefaultResource("hdfs-site.xml");//Configuration.addDefaultResource("mapred-default.xml");//Configuration.addDefaultResource("mapred-site.xml");}@Overridepublic int run(String[] args) throws Exception {Configuration conf = getConf();for (Entry<String, String> entry : conf) {System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());}return 0;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new ToolRunnerDemo(), args);System.exit(exitCode);} }

以上程序用于輸出作業中加載的屬性。

1、直接運行程序

[root@jediael project]#hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo io.seqfile.compress.blocksize=1000000 keep.failed.task.files=false mapred.disk.healthChecker.interval=60000 dfs.df.interval=60000 dfs.datanode.failed.volumes.tolerated=0 mapreduce.reduce.input.limit=-1 mapred.task.tracker.http.address=0.0.0.0:50060 mapred.used.genericoptionsparser=true mapred.userlog.retain.hours=24 dfs.max.objects=0 mapred.jobtracker.jobSchedulable=org.apache.hadoop.mapred.JobSchedulable mapred.local.dir.minspacestart=0 hadoop.native.lib=true ......................

2、通過-D指定新的參數

[root@jediael project]# hadoop org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo -D color=yello | grep color color=yello

當然更常用的是覆蓋默認的參數，如：

-Ddfs.df.interval=30000

3、通過-conf增加新的配置文件

（1）原有參數數量

[root@jediael project]# hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo | wc 67 67 2994

（2）增加配置文件后的參數數量

[root@jediael project]# hadoop jar toolrunnerdemo.jar org.jediael.hadoopdemo.toolrunnerdemo.ToolRunnerDemo -conf /opt/jediael/hadoop-1.2.0/conf/mapred-site.xml | wc 68 68 3028

其中mapred-site.xml的內容如下：

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.job.tracker</name><value>localhost:9001</value></property> </configuration>

可見此文件只有一個property，因此參數數量從67個變成了68個。

4、在代碼中增加參數，如上面程序中注釋掉的語句

static {Configuration.addDefaultResource("hdfs-default.xml");Configuration.addDefaultResource("hdfs-site.xml");Configuration.addDefaultResource("mapred-default.xml");Configuration.addDefaultResource("mapred-site.xml");}

更多選項請見第Configuration的解釋。

二、示例程序二：典型用法（修改wordcount程序）

修改經典的wordcount程序，參考：Hadoop入門經典:WordCount

package org.jediael.hadoopdemo.toolrunnerdemo;import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner;public class WordCount extends Configured implements Tool{public static class WordCountMap extendsMapper<LongWritable, Text, Text, IntWritable> {private final IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();StringTokenizer token = new StringTokenizer(line);while (token.hasMoreTokens()) {word.set(token.nextToken());context.write(word, one);}}}public static class WordCountReduce extendsReducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}@Overridepublic int run(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMap.class);job.setReducerClass(WordCountReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));return(job.waitForCompletion(true)?0:-1);}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new WordCount(), args);System.exit(exitCode);}}

運行程序：

[root@jediael project]# hadoop fs -mkdir wcin2 [root@jediael project]# hadoop fs -copyFromLocal /opt/jediael/apache-nutch-2.2.1/CHANGES.txt wcin2 [root@jediael project]# hadoop jar wordcount2.jar org.jediael.hadoopdemo.toolrunnerdemo.WordCount wcin2 wcout2

由上可見，關于ToolRunner的典型用法是：
1、定義一個類，繼承Configured，實現Tool接口。其中Configured提供了getConf()與setConfig()方法，而Tool則提供了run()方法。
2、在main()方法中通過ToolRunner.run(…)方法調用上述類的run(String[]方法）。

三、相關的類及接口解釋

（一）相關類及其對應關系

從上圖的上半部分可以看出ToolRunner典型的使用方法
1、定義一個類（如上圖中的MyClass），繼承configured，實現Tool接口。
2、由于Tool繼承自Configurable，而Configurable有一個Configuration對象，因為用戶代碼可以簡單的使用getCon()來獲取到作業中的配置。

（二）關于ToolRunner

1、ToolRunner與上圖中的類、接口無任何的繼承、實現關系，它只繼承了Object，沒實現任何接口。
2、ToolRunner可以方便的運行那些實現了Tool接口的類（調用其run(String[]）方法，并通過GenericOptionsParser 可以方便的處理hadoop命令行參數。

A utility to help run Tools.

ToolRunner can be used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool. The application-specific options are passed along without being modified.
3、ToolRunner除了一個空的構造方法以外，只有一個方法，即run()方法，它有以下2種形式：

public static int run(Configuration conf,Tool tool,String[] args)throws ExceptionRuns the given Tool by Tool.run(String[]), after parsing with the given generic arguments. Uses the given Configuration, or builds one if null. Sets the Tool's configuration with the possibly modified version of the conf.Parameters:conf - Configuration for the Tool.tool - Tool to run.args - command-line arguments to the tool.Returns:exit code of the Tool.run(String[]) method.Throws:Exceptionpublic static int run(Tool tool,String[] args)throws ExceptionRuns the Tool with its Configuration. Equivalent to run(tool.getConf(), tool, args).Parameters:tool - Tool to run.args - command-line arguments to the tool.Returns:exit code of the Tool.run(String[]) method.Throws:Exception

它們均是靜態方法，即可以通過類名調用。
（1）public static int run(Configuration conf,Tool tool, String[] args)
這個方法調用tool的run(String[])方法，并使用conf中的參數，以及args中的參數，而args一般來源于命令行。
（2）public static int run(Tool tool, String[] args)
這個方法調用tool的run方法，并使用tool類的參數屬性，即等同于run(tool.getConf(), tool, args)。

除此以外，還有一個方法：

static void printGenericCommandUsage(PrintStream out) Prints generic command-line argurments and usage information.

4、ToolRunner完成以下2個功能：

（1）為Tool創建一個Configuration對象。

（2）使得程序可以方便的讀取參數配置。

ToolRunner完整源代碼如下：

package org.apache.hadoop.util;import java.io.PrintStream;import org.apache.hadoop.conf.Configuration;/*** A utility to help run {@link Tool}s.* * <p><code>ToolRunner</code> can be used to run classes implementing * <code>Tool</code> interface. It works in conjunction with * {@link GenericOptionsParser} to parse the * <a href="{@docRoot}/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions">* generic hadoop command line arguments</a> and modifies the * <code>Configuration</code> of the <code>Tool</code>. The * application-specific options are passed along without being modified.* </p>* * @see Tool* @see GenericOptionsParser*/ public class ToolRunner {/*** Runs the given <code>Tool</code> by {@link Tool#run(String[])}, after * parsing with the given generic arguments. Uses the given * <code>Configuration</code>, or builds one if null.* * Sets the <code>Tool</code>'s configuration with the possibly modified * version of the <code>conf</code>. * * @param conf <code>Configuration</code> for the <code>Tool</code>.* @param tool <code>Tool</code> to run.* @param args command-line arguments to the tool.* @return exit code of the {@link Tool#run(String[])} method.*/public static int run(Configuration conf, Tool tool, String[] args) throws Exception{if(conf == null) {conf = new Configuration();}GenericOptionsParser parser = new GenericOptionsParser(conf, args);//set the configuration back, so that Tool can configure itselftool.setConf(conf);//get the args w/o generic hadoop argsString[] toolArgs = parser.getRemainingArgs();return tool.run(toolArgs);}/*** Runs the <code>Tool</code> with its <code>Configuration</code>.* * Equivalent to <code>run(tool.getConf(), tool, args)</code>.* * @param tool <code>Tool</code> to run.* @param args command-line arguments to the tool.* @return exit code of the {@link Tool#run(String[])} method.*/public static int run(Tool tool, String[] args) throws Exception{return run(tool.getConf(), tool, args);}/*** Prints generic command-line argurments and usage information.* * @param out stream to write usage information to.*/public static void printGenericCommandUsage(PrintStream out) {GenericOptionsParser.printGenericCommandUsage(out);}}

（三）關于Configuration

1、默認情況下，hadoop會加載core-default.xml以及core-site.xml中的參數。

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:core-default.xml : Read-only defaults for hadoop.core-site.xml: Site-specific configuration for a given hadoop installation.

見以下代碼：

static{//print deprecation warning if hadoop-site.xml is found in classpathClassLoader cL = Thread.currentThread().getContextClassLoader();if (cL == null) {cL = Configuration.class.getClassLoader();}if(cL.getResource("hadoop-site.xml")!=null) {LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +"Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "+ "mapred-site.xml and hdfs-site.xml to override properties of " +"core-default.xml, mapred-default.xml and hdfs-default.xml " +"respectively");}addDefaultResource("core-default.xml");addDefaultResource("core-site.xml");}

Configuration.java的源代碼中包含了以上代碼，即通過靜態語句為程序加載core-default.xml以及core-site.xml中的參數。
同時，檢查是否還存在hadoop-site.xml，若還存在，則給出warning，提醒此配置文件已經廢棄。
如何查找到上述2個文件：（見hadoop命令的腳本）
（1）定位HADOOP_CONF_DIR Alternate conf dir. Default is HADOOPHOME/conf.（2）將HADOOPCONFDIR加入CLASSPATH=”{HADOOP_CONF_DIR}”
（3）可以在CLASSPATH中直接查找上述文件。

2、在程序運行時，可以通過命令行修改參數，修改方法如下

https://hadoop.apache.org/docs/r2.6.3/hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options

3、Configuration類中有大量的add****,set****,get****方法，用于設置及獲取參數。

4、Configuration實現了Iterable

（四）關于Tool

1、Tool類的源文件如下

package org.apache.hadoop.util;import org.apache.hadoop.conf.Configurable;public interface Tool extends Configurable {int run(String [] args) throws Exception; }

由此可見，Tool自身只有一個方法run(String[])，同時它繼承了Configuable的2個方法。

（五）關于Configrable與Conifgured
1、Configurable的源文件如下：

package org.apache.hadoop.conf;public interface Configurable {void setConf(Configuration conf);Configuration getConf(); }有2個對于Configuration的set與get方法。2、Configured的源文件如下：package org.apache.hadoop.conf;public class Configured implements Configurable {private Configuration conf;public Configured() {this(null);}public Configured(Configuration conf) {setConf(conf);}public void setConf(Configuration conf) {this.conf = conf;}public Configuration getConf() {return conf;}}

它有2個構造方法，分別是帶Configuration參數的方法與不還參數的方法。
實現了Configuable中對于Configuration的set與get方法。

四、總結

1、通過使用ToolRunner.run(…)方法，可以更便利的使用hadoop命令行參數。
2、ToolRunner.run(…)通過調用Tool類中的run(String[])方法來運行hadoop程序，并默認加載core-default.xml與core-site.xml中的參數。

總結

以上是生活随笔為你收集整理的使用ToolRunner运行Hadoop作业的原理及用法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。