原始資料如下:
u1 a,d,b,c
u2 a,a,c
u3 b,d
u4 a,d,c
u5 a,b,c
計算公式使用:sim = u(i)∩u(j) / (u(i)∪u(j))
其中: (u(i)∪u(j)) = u(i) + u(j) - u(i)∩u(j)
原始的hadoop實現需要5輪mr,優化後只需要兩輪就可以完成。
之前的輪數過多,主要在於計算(u(i)∪u(j)) 的時候,需要多次更改key,並非計算量大。只需要修改一下傳遞的key,就可以兩輪實現。
#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
for line in sys.stdin:
user,item_str = line.strip().split()
item_list = sorted(list(set(item_str.split(','))))
print "item_str:",item_str,"item_list:",item_list
for i in range(len(item_list)):
i1 = item_list[i]
print i1,1,'norm'
for i2 in item_list[i+1:]:
print i1,i2,1,'dot'
reducer_1.py
#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
def printout():
i1 = old_key
print i1,old_dict['norm'],'norm'
for i2 in old_dict['dot']:
print i1 + "-" + i2,old_dict['dot'][i2],old_dict['norm'],'dot-norm_i1'
old_key = ""
old_dict = }
for line in sys.stdin:
sp = line.strip().split()
if sp[-1] == 'norm':
key,value = sp[:2]
if key == old_key:
old_dict['norm'] += int(value)
else:
if old_key != "":
printout()
old_key = key
# notice: norm part should be int(value)
old_dict = }
elif sp[-1] == 'dot':
key,i2,value = sp[:3]
if key == old_key:
if i2 not in old_dict['dot']:
old_dict['dot'][i2] = 0
old_dict['dot'][i2] += int(value)
else:
if old_dot_key != "":
printout()
old_key = key
old_dict = }
if old_key != "":
printout()
#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
for line in sys.stdin:
sp = line.strip().split()
if sp[-1] == 'norm':
print line.strip()
elif sp[-1] == "dot-norm_i1":
key,dot,norm_i1 = sp[:3]
i1,i2 = key.split('-')
print i2,i1,dot,norm_i1,'dot-norm_i1'
reducer_2.py
#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
def gensim(norm_i1,norm_i2,dot):
return float(dot) / (int(norm_i1) + int(norm_i2) - int(dot))
def printout():
i2 = old_key
norm_i2 = old_dict['norm']
for i1 in old_dict['dot']:
dot,norm_i1 = old_dict['dot'][i1]
sim = gensim(norm_i1,norm_i2,dot)
print i1+"-"+i2,dot,norm_i1,norm_i2,sim,'dot,norm_i1,norm_i2,sim'
old_key = ""
old_dict = }
for line in sys.stdin:
sp = line.strip().split()
if sp[-1] == 'norm':
key,value = sp[:2]
if key == old_key:
old_dict['norm'] = value
else:
if old_key != "":
printout()
old_key = key
old_dict = }
elif sp[-1] == 'dot-norm_i1':
key,i1,dot,norm_i1 = sp[:4] #key is i2.
if key == old_key:
if i1 not in old_dict['dot']:
old_dict['dot'][i1] = (dot,norm_i1)
else:
if old_key != "":
printout()
old_key = key
old_dict = }
if old_key != "":
printout()
執行指令碼 t.sh:
#!/bin/bash
cat d.m.1 |./reducer_1.py > d.r.1
cat d.m.2 |./reducer_2.py > d.r.2
關於UserCF和ItemCF的那點事
usercf和itemcf是協同過濾中最為古老的兩種演算法,在top n的推薦上被廣泛應用。這兩個演算法之所以重要,是因為他們使用了兩個不同的推薦系統基本假設。usercf認為乙個人會喜歡和他有相同愛好的人喜歡的東西,而itemcf認為乙個人會喜歡和他以前喜歡的東西相似的東西。這兩個假設都有其合理性...
hadoop 基於Streaming實現的編譯
hadoop入門教程 基於streaming實現的編譯,在streaming介面實現的程式中,使用者的map和reduce都是單獨的可執行程式,在上節實現中是使用c 實現的,包括map程式wordcountmap.cpp,reduce程式wordcountreduce.cpp。由於寫streamin...
Hadoop的 wordcount 實現字母計數)
首先,在你的hdfs建立目錄 hadoop fs mkdir p wc input 第一,建立你的原檔案 touch wordcount.txt vim wordcount.txt 第二,將你的原檔案上傳到 wc input 中 hadoop fs put home wordcount.txt wc...