當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）

發布時間：2023/12/18 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?

聲明：

?

　　1）本文由我bitpeach原創撰寫，轉載時請注明出處，侵權必究。

?

????2）本小實驗工作環境為Windows系統下的百度云（聯網），和Ubuntu系統的hadoop1-2-1（自己提前配好）。如不清楚配置可看《Hadoop之詞頻統計小實驗初步配置》

?

????3）本文由于過長，無法一次性上傳。其相鄰相關的博文，可參見《Hadoop的改進實驗（中文分詞詞頻統計及英文詞頻統計）博文目錄結構》，以閱覽其余三篇剩余內容文檔。

?

（五）單機偽分布的英文詞頻統計Python&Streaming

Python與Streaming背景

Python與Streaming

背景：Python程序也可以運用至hadoop中，但不可以使用MapReduce框架，只可以使用Streaming模式借口，該接口專為非java語言提供接口，如C，shell腳本等。

????1）單機本機

????Hadoop 0.21.0之前的版本中的Hadoop Streaming工具只支持文本格式的數據，而從Hadoop 0.21.0開始，也支持二進制格式的數據。hadoop streaming調用非java程序的格式接口為：

????Usage: $HADOOP_HOME/bin/hadoop jar \

????$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar [options]

其Options選項大致為：

（1）-input：輸入文件路徑

（2）-output：輸出文件路徑

（3）-mapper：用戶自己寫的mapper程序，可以是可執行文件或者腳本

（4）-reducer：用戶自己寫的reducer程序，可以是可執行文件或者腳本

（5）-file：打包文件到提交的作業中，可以是mapper或者reducer要用的輸入文件，如配置文件，字典等。

（6）-partitioner：用戶自定義的partitioner程序

（7）-combiner：用戶自定義的combiner程序（必須用java實現）

（8）-D：作業的一些屬性（以前用的是-jonconf）

舉個例子，具體可以是：

$HADOOP_HOME/bin/hadoop jar \

contrib/streaming/hadoop-0.20.2-streaming.jar \

-input input \

-ouput output \

-mapper mapper.py \

-reducer reducer.py \

-file mapper.py \

-file reducer.py \

????2）百度開放云

????百度開放云很是方便，方便在于提供好了streaming的模式接口，如果需要本機提供此接口，需要將調用hadoop里的streaming.jar包，其次格式非常麻煩，有時總會不成功。不如百度開放云使用方便，當然了物有兩面，百度開放云對于中文處理，顯示總是亂碼，故處理中文類，還是需要單機下的hadoop平臺。

????當然了，和單機下一樣，至少你要寫好兩個python腳本，一個負責mapper，一個負責reducer，然后接下來后續步驟。

百度開放云提供的接口是：
hadoop jar $hadoop_streaming –input Input –output Output –mapper "python mapper.py" –reducer "python reducer.py" –file mapper.py –file reducer.py

只要環境做好，非常好用，直接成功。

Python英文詞頻統計實驗

實驗過程

背景：Python程序也可以運用至hadoop中，但不可以使用MapReduce框架，只可以使用Streaming模式借口，該接口專為非java語言提供接口，如C，shell腳本等。

下面的步驟均是在百度開放云上進行操作的，如需在本機上操作，原理是一樣的，命令也基本相同的。

? ??1）準備數據

????先打算處理簡單文本，因此上傳了三個簡單的英文單詞文本。如下圖所示，我們可以看到文本里的內容。

????然后，我們要開始準備python腳本，下表可看兩個腳本的內容。

# Mapper.py

#!/usr/bin/env python

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words while removing any empty strings

words = filter(lambda word: word, line.split())

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

# Reducer.py

#!/usr/bin/env python

from operator import itemgetter

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# parse the input we got from mapper.py

word, count = line.split()

# convert count (currently a string) to int

try:

count = int(count)

word2count[word] = word2count.get(word, 0) + count

except ValueError:

# count was not a number, so silently

# ignore/discard this line

pass

# sort the words lexigraphically;

# this step is NOT required, we just do it so that our

# final output will look more like the official Hadoop

# word count examples

sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)

for word, count in sorted_word2count:

print '%s\t%s'% (word, count)

????接著，上傳兩個腳本，并執行指令：

????hadoop jar $hadoop_streaming -input Input -output Output -mapper "python ????mapper.py" -reducer "python reducer.py" -file mapper.py -file reducer.py

????工作狀態的示意圖如下圖所示：

????最后出現結果，結果如圖所示。

????至此，streaming模式的英文詞頻統計實驗結束。

? <<<<<<<<<? 寫在頁面最底的小額打賞? >>>>>>>>>

如果讀者親愿意的話，可以小額打賞我，感謝您的打賞。您的打賞是我的動力，非常感激。

必讀：如您愿意打賞，打賞方式任選其一，本頁面右側的公告欄有支付寶方式打賞，微信方式打賞。

避免因打賞產生法律問題，兩種打賞方式的任一打賞金額上限均為5元，謝謝您的支持。

如有問題，請24小時內通知本人郵件。

轉載于:https://www.cnblogs.com/bitpeach/p/3756172.html

總結

以上是生活随笔為你收集整理的Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。