第乙個Hadoop程式

需求：有100個檔案（每個大概10g，300萬個樣例）每個樣例可以得到對應的類別屬性屬性值。統計屬性值出現的次數

類似 wordcount

其中 word 是類（cat1-cat3)屬性屬性值

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import  json
import os
forline
in sys.stdin:
s = line.split('\t')[1]
obj = json.loads(s)
cat = ''
cat = obj['l_cat'][0] + '-' + obj['l_cat'][2]
word = ''
word = cat + ',' + 'cat_3' +','+cat
print '%s\t%s' % (word, 1)
word = ''
word = cat + ','+'brand'+','+obj['brand']
print '%s\t%s' % (word, 1)
for att, val in obj['d_attr'].items():
if att =='商品毛重'
or att == '商品名稱'
or att == '商品編號' \
or att == '貨號'
or att == '店鋪'
orlen(val)>10:
continue
word = ''
word = cat+','+att+','+val
print '%s\t%s' % (word, 1)

reducer.py

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import sys
current_word = none
current_count = 0
word = none
forline
in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except valueerror:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)

本地測試 part-00000-sample是輸入，輸出到out

檢視測試的結果

本地測試成功了才可以提交到hadoop 上

通過指令碼提交到hadoop 上

！！！若hadoop上python 版本低，需將python 環境打包到hadoop上

參考文獻

指令碼 run_hadoop.sh

#!/bin/bash hadoop="/home/linzhiwei02/tools/hadoop-client-heng/hadoop/bin/hadoop" $hadoop streaming \ -d mapred.job.queue.name="nlp"\ -d mapred.job.name="nlp-linzhiwei02-count-item"\ -d mapred.job.priority=normal \ -d mapred.map.tasks=400 \ -d mapred.reduce.tasks=100\ -file ./reducer.py\ -reducer "python27/bin/python reducer.py" \ -partitioner "org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner"

hadoop的第乙個程式wordcount實現

具體安裝步驟請見部落格 linux hadoop 2.7 偽分布式安裝簡單幾步實現 1.在本地新建乙個檔案，筆者在 hadoop 2.7.1 local data資料夾新建了乙個文字檔案hello.txt，local data資料夾也是新建的。檔案內容是 he wo shi he jing shao...

第乙個hadoop程式過程和問題清單

一首先，我是大體是根據 spark亞太研究院系列叢書 spark實戰高手之路從零開始這本書來配置hadoop的。1.先配置hadoop單機模式並執行wordcount 基本是按照這個流程來做的，但是期間遇到了一些問題。org.apache.hadoop.mapreduce.lib.input...

第乙個視窗程式

程式截圖程式 include lresult callback wndproc hwnd,uint,wparam,lparam int winapi winmain hinstance hinstance,hinstance hprevinstance,pstr szcmdline,int icm...

第乙個Hadoop程式

hadoop的第乙個程式wordcount實現

第乙個hadoop程式 過程和問題清單

第乙個視窗程式

相關推薦

第乙個hadoop程式過程和問題清單