hadoop程序开发 --- python
生活随笔
收集整理的這篇文章主要介紹了
hadoop程序开发 --- python
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
這里以統(tǒng)計(jì)單詞為例
1 首先建立mapper.py
mkdir /usr/local/hadoop-python cd /usr/local/hadoop-python vim mapper.pymapper.py
#!/usr/bin/env pythonimport sys# input comes from STDIN (standard input) 輸入來(lái)自STDIN(標(biāo)準(zhǔn)輸入) for line in sys.stdin:# remove leading and trailing whitespace 刪除前導(dǎo)和尾隨空格line = line.strip()# split the line into words 把線分成單詞words = line.split()# increase counters 增加柜臺(tái)for word in words:# write the results to STDOUT (standard output); # 將結(jié)果寫(xiě)入STDOUT(標(biāo)準(zhǔn)輸出);# what we output here will be the input for the# Reduce step, i.e. the input for reducer.py# tab-delimited; the trivial word count is 1# 我們?cè)诖颂庉敵龅膬?nèi)容將是Reduce步驟的輸入,即reducer.py制表符分隔的輸入; # 平凡的字?jǐn)?shù)是1print '%s\t%s' % (word, 1)文件保存后,請(qǐng)注意將其權(quán)限作出相應(yīng)修改:
chmod a+x /usr/local/hadoop-python/mapper.py2 建立reducer.py
vim reducer.py #!/usr/bin/env pythonfrom operator import itemgetter import syscurrent_word = None current_count = 0 word = None# input comes from STDIN 輸入來(lái)自STDIN for line in sys.stdin:# remove leading and trailing whitespace # 刪除前導(dǎo)和尾隨空格line = line.strip()# parse the input we got from mapper.py# 解析我們從mapper.py獲得的輸入word, count = line.split('\t', 1)# convert count (currently a string) to int# 將count(當(dāng)前為字符串)轉(zhuǎn)換為inttry:count = int(count)except ValueError:# count was not a number, so silently# ignore/discard this line# count不是數(shù)字,因此請(qǐng)忽略/丟棄此行continue# this IF-switch only works because Hadoop sorts map output# by key (here: word) before it is passed to the reducer# 該IF開(kāi)關(guān)僅起作用是因?yàn)镠adoop在將映射輸出傳遞給reducer之前按鍵(此處為word)對(duì) # 映射輸出進(jìn)行排序if current_word == word:current_count += countelse:if current_word:# write result to STDOUT# 將結(jié)果寫(xiě)入STDOUTprint '%s\t%s' % (current_word, current_count)current_count = countcurrent_word = word# do not forget to output the last word if needed! # 如果需要,不要忘記輸出最后一個(gè)單詞! if current_word == word:print '%s\t%s' % (current_word, current_count)文件保存后,請(qǐng)注意將其權(quán)限作出相應(yīng)修改:
chmod a+x /usr/local/hadoop-python/reducer.py首先可以在本機(jī)上測(cè)試以上代碼,這樣如果有問(wèn)題可以及時(shí)發(fā)現(xiàn):
# echo "foo foo quux labs foo bar quux" | /usr/local/hadoop-python/mapper.py 輸出: foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1再運(yùn)行以下包含reduce.py的代碼:
echo "foo foo quux labs foo bar quux" | /usr/local/hadoop-python/mapper.py | sort -k1,1 | /usr/local/hadoop-python/reducer.py 輸出: bar 1 foo 3 labs 1 quux 23 在Hadoop上運(yùn)行Python代碼
準(zhǔn)備工作:
下載文本文件:
然后把這二本書(shū)上傳到hdfs文件系統(tǒng)上:
# 在hdfs上的該用戶目錄下創(chuàng)建一個(gè)輸入文件的文件夾 hdfs dfs -mkdir /input # 上傳文檔到hdfs上的輸入文件夾中 hdfs dfs -put /usr/local/hadoop-python/input/pg20417.txt /input尋找你的streaming的jar文件存放地址,注意2.6的版本放到share目錄下了,可以進(jìn)入hadoop安裝目錄尋找該文件:
cd $HADOOP_HOME find ./ -name "*streaming*.jar"然后就會(huì)找到我們的share文件夾中的hadoop-straming*.jar文件:
./share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar ./share/hadoop/tools/sources/hadoop-streaming-2.8.4-test-sources.jar ./share/hadoop/tools/sources/hadoop-streaming-2.8.4-sources.jar /usr/local/hadoop-2.8.4/share/hadoop/tools/lib由于這個(gè)文件的路徑比較長(zhǎng),因此我們可以將它寫(xiě)入到環(huán)境變量:
vim /etc/profile export STREAM=/usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar由于通過(guò)streaming接口運(yùn)行的腳本太長(zhǎng)了,因此直接建立一個(gè)shell名稱為run.sh來(lái)運(yùn)行:
vim run.sh hadoop jar /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar \ -files /usr/local/hadoop-python/mapper.py,/usr/local/hadoop-python/reducer.py \ -mapper /usr/local/hadoop-python/mapper.py \ -reducer /usr/local/hadoop-python/reducer.py \ -input /input/pg20417.txt \ -output /output1 hadoop jar $STREAM \-files /usr/local/hadoop-python/mapper.py,/usr/local/hadoop-python/reducer.py \-mapper /usr/local/hadoop-python/mapper.py \-reducer /usr/local/hadoop-python/reducer.py \-input /input/pg20417.txt \-output /output1總結(jié)
以上是生活随笔為你收集整理的hadoop程序开发 --- python的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: hadoop程序开发--- Java
- 下一篇: boostrap3常用组件集合