Hadoop入门经典:WordCount
生活随笔
收集整理的這篇文章主要介紹了
Hadoop入门经典:WordCount
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
以下程序在hadoop1.2.1上測試成功。
本例先將源代碼呈現,然后詳細說明執行步驟,最后對源代碼及執行過程進行分析。
一、源代碼
package org.jediael.hadoopdemo.wordcount;import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class WordCount {public static class WordCountMap extendsMapper<LongWritable, Text, Text, IntWritable> {private final IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();StringTokenizer token = new StringTokenizer(line);while (token.hasMoreTokens()) {word.set(token.nextToken());context.write(word, one);}}}public static class WordCountReduce extendsReducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMap.class);job.setReducerClass(WordCountReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);} }二、執行程序
1、從eclipse從導出至wordcount.jar,并上傳至hadoop服務器,本例中,將程序上傳至/home/jediael/project。
2、安裝hadoop偽分布模式,可參考Hadoop1.2.1偽分布模式安裝指南,本實例將運行在hadoop的偽公布環境中。
3、在HDFS中創建目錄wcinput,用作輸入目錄,并將需要分析的文件復制到目錄下。
[root@jediael conf]# hadoop fs -mkdir wcinput [root@jediael conf]# hadoop fs -copyFromLocal * wcinput [root@jediael conf]# hadoop fs -ls wcinput Found 26 items -rw-r--r-- 1 root supergroup 1524 2014-08-20 12:29 /user/root/wcinput/automaton-urlfilter.txt -rw-r--r-- 1 root supergroup 1311 2014-08-20 12:29 /user/root/wcinput/configuration.xsl -rw-r--r-- 1 root supergroup 131090 2014-08-20 12:29 /user/root/wcinput/domain-suffixes.xml -rw-r--r-- 1 root supergroup 4649 2014-08-20 12:29 /user/root/wcinput/domain-suffixes.xsd -rw-r--r-- 1 root supergroup 824 2014-08-20 12:29 /user/root/wcinput/domain-urlfilter.txt -rw-r--r-- 1 root supergroup 3368 2014-08-20 12:29 /user/root/wcinput/gora-accumulo-mapping.xml -rw-r--r-- 1 root supergroup 3279 2014-08-20 12:29 /user/root/wcinput/gora-cassandra-mapping.xml -rw-r--r-- 1 root supergroup 3447 2014-08-20 12:29 /user/root/wcinput/gora-hbase-mapping.xml -rw-r--r-- 1 root supergroup 2677 2014-08-20 12:29 /user/root/wcinput/gora-sql-mapping.xml -rw-r--r-- 1 root supergroup 2993 2014-08-20 12:29 /user/root/wcinput/gora.properties -rw-r--r-- 1 root supergroup 983 2014-08-20 12:29 /user/root/wcinput/hbase-site.xml -rw-r--r-- 1 root supergroup 3096 2014-08-20 12:29 /user/root/wcinput/httpclient-auth.xml -rw-r--r-- 1 root supergroup 3948 2014-08-20 12:29 /user/root/wcinput/log4j.properties -rw-r--r-- 1 root supergroup 511 2014-08-20 12:29 /user/root/wcinput/nutch-conf.xsl -rw-r--r-- 1 root supergroup 42610 2014-08-20 12:29 /user/root/wcinput/nutch-default.xml -rw-r--r-- 1 root supergroup 753 2014-08-20 12:29 /user/root/wcinput/nutch-site.xml -rw-r--r-- 1 root supergroup 347 2014-08-20 12:29 /user/root/wcinput/parse-plugins.dtd -rw-r--r-- 1 root supergroup 3016 2014-08-20 12:29 /user/root/wcinput/parse-plugins.xml -rw-r--r-- 1 root supergroup 857 2014-08-20 12:29 /user/root/wcinput/prefix-urlfilter.txt -rw-r--r-- 1 root supergroup 2484 2014-08-20 12:29 /user/root/wcinput/regex-normalize.xml -rw-r--r-- 1 root supergroup 1736 2014-08-20 12:29 /user/root/wcinput/regex-urlfilter.txt -rw-r--r-- 1 root supergroup 18969 2014-08-20 12:29 /user/root/wcinput/schema-solr4.xml -rw-r--r-- 1 root supergroup 6020 2014-08-20 12:29 /user/root/wcinput/schema.xml -rw-r--r-- 1 root supergroup 1766 2014-08-20 12:29 /user/root/wcinput/solrindex-mapping.xml -rw-r--r-- 1 root supergroup 1044 2014-08-20 12:29 /user/root/wcinput/subcollections.xml -rw-r--r-- 1 root supergroup 1411 2014-08-20 12:29 /user/root/wcinput/suffix-urlfilter.txt4、運行程序 [root@jediael project]# hadoop org.jediael.hadoopdemo.wordcount.WordCount wcinput wcoutput3 14/08/20 12:50:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/08/20 12:50:26 INFO input.FileInputFormat: Total input paths to process : 26 14/08/20 12:50:26 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/08/20 12:50:26 WARN snappy.LoadSnappy: Snappy native library not loaded 14/08/20 12:50:26 INFO mapred.JobClient: Running job: job_201408191134_0005 14/08/20 12:50:27 INFO mapred.JobClient: map 0% reduce 0% 14/08/20 12:50:38 INFO mapred.JobClient: map 3% reduce 0% 14/08/20 12:50:39 INFO mapred.JobClient: map 7% reduce 0% 14/08/20 12:50:50 INFO mapred.JobClient: map 15% reduce 0% 14/08/20 12:50:57 INFO mapred.JobClient: map 19% reduce 0% 14/08/20 12:50:58 INFO mapred.JobClient: map 23% reduce 0% 14/08/20 12:51:00 INFO mapred.JobClient: map 23% reduce 5% 14/08/20 12:51:04 INFO mapred.JobClient: map 30% reduce 5% 14/08/20 12:51:06 INFO mapred.JobClient: map 30% reduce 10% 14/08/20 12:51:11 INFO mapred.JobClient: map 38% reduce 10% 14/08/20 12:51:16 INFO mapred.JobClient: map 38% reduce 11% 14/08/20 12:51:18 INFO mapred.JobClient: map 46% reduce 11% 14/08/20 12:51:19 INFO mapred.JobClient: map 46% reduce 12% 14/08/20 12:51:22 INFO mapred.JobClient: map 46% reduce 15% 14/08/20 12:51:25 INFO mapred.JobClient: map 53% reduce 15% 14/08/20 12:51:31 INFO mapred.JobClient: map 53% reduce 17% 14/08/20 12:51:32 INFO mapred.JobClient: map 61% reduce 17% 14/08/20 12:51:39 INFO mapred.JobClient: map 69% reduce 17% 14/08/20 12:51:40 INFO mapred.JobClient: map 69% reduce 20% 14/08/20 12:51:45 INFO mapred.JobClient: map 73% reduce 20% 14/08/20 12:51:46 INFO mapred.JobClient: map 76% reduce 23% 14/08/20 12:51:52 INFO mapred.JobClient: map 80% reduce 23% 14/08/20 12:51:53 INFO mapred.JobClient: map 84% reduce 23% 14/08/20 12:51:55 INFO mapred.JobClient: map 84% reduce 25% 14/08/20 12:51:59 INFO mapred.JobClient: map 88% reduce 25% 14/08/20 12:52:00 INFO mapred.JobClient: map 92% reduce 25% 14/08/20 12:52:02 INFO mapred.JobClient: map 92% reduce 29% 14/08/20 12:52:06 INFO mapred.JobClient: map 96% reduce 29% 14/08/20 12:52:07 INFO mapred.JobClient: map 100% reduce 29% 14/08/20 12:52:11 INFO mapred.JobClient: map 100% reduce 30% 14/08/20 12:52:15 INFO mapred.JobClient: map 100% reduce 100% 14/08/20 12:52:17 INFO mapred.JobClient: Job complete: job_201408191134_0005 14/08/20 12:52:18 INFO mapred.JobClient: Counters: 29 14/08/20 12:52:18 INFO mapred.JobClient: Job Counters 14/08/20 12:52:18 INFO mapred.JobClient: Launched reduce tasks=1 14/08/20 12:52:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=192038 14/08/20 12:52:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/08/20 12:52:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/08/20 12:52:18 INFO mapred.JobClient: Launched map tasks=26 14/08/20 12:52:18 INFO mapred.JobClient: Data-local map tasks=26 14/08/20 12:52:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=95814 14/08/20 12:52:18 INFO mapred.JobClient: File Output Format Counters 14/08/20 12:52:18 INFO mapred.JobClient: Bytes Written=123950 14/08/20 12:52:18 INFO mapred.JobClient: FileSystemCounters 14/08/20 12:52:18 INFO mapred.JobClient: FILE_BYTES_READ=352500 14/08/20 12:52:18 INFO mapred.JobClient: HDFS_BYTES_READ=247920 14/08/20 12:52:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2177502 14/08/20 12:52:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=123950 14/08/20 12:52:18 INFO mapred.JobClient: File Input Format Counters 14/08/20 12:52:18 INFO mapred.JobClient: Bytes Read=244713 14/08/20 12:52:18 INFO mapred.JobClient: Map-Reduce Framework 14/08/20 12:52:18 INFO mapred.JobClient: Map output materialized bytes=352650 14/08/20 12:52:18 INFO mapred.JobClient: Map input records=7403 14/08/20 12:52:18 INFO mapred.JobClient: Reduce shuffle bytes=352650 14/08/20 12:52:18 INFO mapred.JobClient: Spilled Records=45210 14/08/20 12:52:18 INFO mapred.JobClient: Map output bytes=307281 14/08/20 12:52:18 INFO mapred.JobClient: Total committed heap usage (bytes)=3398606848 14/08/20 12:52:18 INFO mapred.JobClient: CPU time spent (ms)=14400 14/08/20 12:52:18 INFO mapred.JobClient: Combine input records=0 14/08/20 12:52:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=3207 14/08/20 12:52:18 INFO mapred.JobClient: Reduce input records=22605 14/08/20 12:52:18 INFO mapred.JobClient: Reduce input groups=6749 14/08/20 12:52:18 INFO mapred.JobClient: Combine output records=0 14/08/20 12:52:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=4799041536 14/08/20 12:52:18 INFO mapred.JobClient: Reduce output records=6749 14/08/20 12:52:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=19545337856 14/08/20 12:52:18 INFO mapred.JobClient: Map output records=226055、查看結果 root@jediael project]# hadoop fs -ls wcoutput3 Found 3 items -rw-r--r-- 1 root supergroup 0 2014-08-20 12:52 /user/root/wcoutput3/_SUCCESS drwxr-xr-x - root supergroup 0 2014-08-20 12:50 /user/root/wcoutput3/_logs -rw-r--r-- 1 root supergroup 123950 2014-08-20 12:52 /user/root/wcoutput3/part-r-00000 [root@jediael project]# hadoop fs -cat wcoutput3/part-r-00000 !! ? ? ?2 !ci.*.*.us ? ? ?1 !co.*.*.us ? ? ?1 !town.*.*.us ? ?1 "AS ? ? 22 "Accept" ? ? ? ?1 "Accept-Language" ? ? ? 1 "License"); ? ? 22 "NOW" ? 1 "WiFi" ?1 "Z" ? ? 1 "all" ? 1 "content" ? ? ? 1 "delete 1 "delimiter" ? ? 1………………三、程序分析 1、WordCountMap類繼承了org.apache.hadoop.mapreduce.Mapper,4個泛型類型分別是map函數輸入key的類型,輸入value的類型,輸出key的類型,輸出value的類型。
2、WordCountReduce類繼承了org.apache.hadoop.mapreduce.Reducer,4個泛型類型含義與map類相同。
3、map的輸出類型與reduce的輸入類型相同,而一般情況下,map的輸出類型與reduce的輸出類型相同,因此,reduce的輸入類型與輸出類型相同。
4、hadoop根據以下代碼確定輸入內容的格式: job.setInputFormatClass(TextInputFormat.class); TextInputFormat是hadoop默認的輸入方法,它繼承自FileInputFormat。在TextInputFormat中,它將數據集切割成小數據集InputSplit,每一個InputSplit由一個mapper處理。此外,InputFormat還提供了一個RecordReader的實現,將一個InputSplit解析成<key,value>的形式,并提供給map函數: key:這個數據相對于數據分片中的字節偏移量,數據類型是LongWritable。 value:每行數據的內容,類型是Text。 因此,在本例中,map函數的key/value類型是LongWritable與Text。
5、Hadoop根據以下代碼確定輸出內容的格式: job.setOutputFormatClass(TextOutputFormat.class); TextOutputFormat是hadoop默認的輸出格式,它會將每條記錄一行的形式存入文本文件,如 the 30 happy 23 ……
總結
以上是生活随笔為你收集整理的Hadoop入门经典:WordCount的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【Nutch2.2.1基础教程之3】Nu
- 下一篇: 使用ToolRunner运行Hadoop