當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

hadoop学习记录

發(fā)布時(shí)間：2024/4/17 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 hadoop学习记录小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一、Hadoop學(xué)習(xí)

Hadoop由hdfs和MapReducer組成，hadoop是主流的大數(shù)據(jù)基礎(chǔ)架構(gòu)

Hdfs是hadoop的一種分布式文件系統(tǒng)

MapReducer是hadoop的分布式計(jì)算方法

二、hadoop環(huán)境配置

配置jdk和hadoop

鏈接：https://blog.csdn.net/sujiangming/article/details/88047006

三、MapReducer工作原理

?map task

程序根據(jù)inputformat將輸入文件分為多個(gè)spilts，每個(gè)spilts作為一個(gè)map task的輸入，map的進(jìn)行處理，將結(jié)果傳給reduce

reduce task

當(dāng)所有的map task完成后，每個(gè)map task會(huì)形成一個(gè)最終文件，并且該文件按區(qū)劃分。reduce任務(wù)啟動(dòng)之前，一個(gè)map task完成后，就會(huì)啟動(dòng)線程來拉取map結(jié)果數(shù)據(jù)到相應(yīng)的reduce task，不斷地合并數(shù)據(jù)，為reduce的數(shù)據(jù)輸入做準(zhǔn)備，當(dāng)所有的map tesk完成后，數(shù)據(jù)也拉取合并完畢后，reduce task 啟動(dòng)，最終將輸出輸出結(jié)果存入HDFS上。

四、Java編程

Hadoop的java編程運(yùn)行步驟：

1、編寫Java程序

任務(wù)：分析電商平臺(tái)日志，計(jì)算商品每個(gè)用戶瀏覽次數(shù)，id為用戶id

可以看出主要商品頁面點(diǎn)擊，只改變用戶id，所以可以利用正則表達(dá)式截取文本，然后計(jì)算文本出現(xiàn)的次數(shù)。

編程實(shí)現(xiàn)：

導(dǎo)入jar包：hadoop-2.7.3是hadoop壓縮包解壓后的文件

hadoop-2.7.3\share\hadoop文件夾中的common 、hdfs、 mapreduce文件夾中的jar包和每個(gè)lib文件夾中的jar包導(dǎo)入

MyMaper類，實(shí)現(xiàn)map功能，將數(shù)據(jù)分散，然后傳入MyReducer中；

import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {//將text數(shù)據(jù)類型轉(zhuǎn)變?yōu)镾tring類型String formtValue=value.toString();//用正則表達(dá)式截取指定字符間的文本Pattern p = Pattern.compile("id=(.*?) HTTP");Matcher m = p.matcher(formtValue);while (m.find()) {//將數(shù)據(jù)傳給下個(gè)階段//m.group(1)文本不包含指定字符，若為m.group(0)，則文本包含指定字符context.write(new Text(m.group(1)),new IntWritable(1));} }}

MyReducer類，實(shí)現(xiàn)Reducer功能，然后將數(shù)據(jù)結(jié)果傳入下一階段；

import java.io.IOException;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer;public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Override //key相同的使用同一個(gè)對(duì)象protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int count=0;//每個(gè)單詞出現(xiàn)次數(shù)for(IntWritable value:values) {//mapper階段的每個(gè)單詞次數(shù)相加count+=value.get();}//將數(shù)據(jù)傳入下一階段context.write(new Text(key), new IntWritable(count));}}

MyJob類，主要是控制map和Reducer的輸出格式和啟動(dòng)程序

import java.io.IOException;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MyJob {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {//1、創(chuàng)建程序入口及獲取配置文件對(duì)象Configuration configuration=new Configuration();Job job=Job.getInstance(configuration);job.setJarByClass(MyJob.class);//2、指定Job的map的輸出及輸出類型job.setMapperClass(MyMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);//3、指定Job的reduce的輸出及輸出類型job.setReducerClass(MyReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//4、指定Job的輸入文件及輸出結(jié)果的路徑FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));//5、執(zhí)行程序job.waitForCompletion(true);} }

2、打包成jar包

右鍵點(diǎn)擊項(xiàng)目名，點(diǎn)擊Export，然后選擇JAR file格式，點(diǎn)擊下一步，點(diǎn)擊瀏覽，設(shè)置jar包保存位置及文件名，再點(diǎn)擊下一步，再點(diǎn)擊一次下一步，選擇main calss ，就是MyJob類

3、Jar包上傳至CentOS系統(tǒng)中

利用WinSCP上傳至CentOS系統(tǒng)，https://winscp.net下載地址

4、將處理數(shù)據(jù)先上傳至CentOS系統(tǒng)上，再上傳至hdfs文件系統(tǒng)上。（centOS系統(tǒng)和hdfs文件系統(tǒng)）

處理數(shù)據(jù)上傳方式同jar包上傳一樣；

上傳至hdfs文件系統(tǒng)命令:

hdfs dfs -put 文件上傳目錄

如：hdfs dfs -put mm.txt /input/

沒有/input目錄需創(chuàng)建

hdfs dfs -mkdir /input

5、運(yùn)行jar文件

運(yùn)行jar文件命令：

hadoop jar test.jar 測(cè)試數(shù)據(jù) 結(jié)果目錄

如：hadoop jar test.jsr /input/mm.txt /result/mytestresult/

補(bǔ)充：

centOS相關(guān)指令：

mkdir 目錄 ?? 創(chuàng)建文件夾

rm -rf 目錄 ?????刪除文件夾及其文件

rm -f 文件路徑 ??????刪除指定文件

Ifconfig ????????查看ip信息

hdfs dfs -put 文件(local) 上傳目錄(hdfs) 上傳文件至hdfs

hdfs dfs -get 文件(hdfs) 下載目錄(local)

hadoop jar test.jar 測(cè)試數(shù)據(jù) 結(jié)果目錄運(yùn)行jar包

?參考鏈接：https://www.cnblogs.com/riordon/p/4605022.html

轉(zhuǎn)載于:https://www.cnblogs.com/chenglin520/p/csdn.html

總結(jié)

以上是生活随笔為你收集整理的hadoop学习记录的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

Hadoop

上一篇：基于ArcSDE、Oralce空间数据库
下一篇：基于灰度变换的图像增强