itemcf的hadoop實現優化 Python

原始資料如下：

u1 a,d,b,c u2 a,a,c u3 b,d u4 a,d,c

u5 a,b,c

計算公式使用：sim = u(i)∩u(j) / (u(i)∪u(j))

其中： (u(i)∪u(j)) = u(i) + u(j) - u(i)∩u(j)

原始的hadoop實現需要5輪mr，優化後只需要兩輪就可以完成。

之前的輪數過多，主要在於計算(u(i)∪u(j)) 的時候，需要多次更改key，並非計算量大。只需要修改一下傳遞的key，就可以兩輪實現。

#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
for line in sys.stdin:
user,item_str = line.strip().split()
item_list = sorted(list(set(item_str.split(','))))
print "item_str:",item_str,"item_list:",item_list
for i in range(len(item_list)):
i1 = item_list[i]
print i1,1,'norm'
for i2 in item_list[i+1:]:
print i1,i2,1,'dot'

reducer_1.py

#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
def printout():
i1 = old_key
print i1,old_dict['norm'],'norm'
for i2 in old_dict['dot']:
print i1 + "-"  + i2,old_dict['dot'][i2],old_dict['norm'],'dot-norm_i1'
old_key = ""
old_dict = }
for line in sys.stdin:
sp = line.strip().split()
if sp[-1] == 'norm':
key,value = sp[:2]
if key == old_key:
old_dict['norm'] += int(value) 
else:
if old_key != "":
printout()
old_key = key
# notice: norm part should be int(value)
old_dict = }
elif sp[-1] ==  'dot':
key,i2,value = sp[:3]
if key == old_key:
if i2 not in old_dict['dot']:
old_dict['dot'][i2] = 0
old_dict['dot'][i2] += int(value)
else:
if old_dot_key != "":
printout()
old_key = key
old_dict = }
if old_key != "":
printout()

#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
for line in sys.stdin:
sp = line.strip().split()
if sp[-1] == 'norm':
print line.strip()
elif sp[-1] == "dot-norm_i1":
key,dot,norm_i1 = sp[:3]
i1,i2 = key.split('-')
print i2,i1,dot,norm_i1,'dot-norm_i1'

reducer_2.py

#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
def gensim(norm_i1,norm_i2,dot):
return float(dot) / (int(norm_i1) + int(norm_i2) - int(dot))
def printout():
i2 = old_key
norm_i2 = old_dict['norm']
for i1 in old_dict['dot']:
dot,norm_i1 = old_dict['dot'][i1]
sim = gensim(norm_i1,norm_i2,dot)
print i1+"-"+i2,dot,norm_i1,norm_i2,sim,'dot,norm_i1,norm_i2,sim'
old_key = ""
old_dict = }
for line in sys.stdin:
sp = line.strip().split()
if sp[-1] == 'norm':
key,value = sp[:2]
if key == old_key:
old_dict['norm'] = value
else:
if old_key != "":
printout()
old_key = key
old_dict = }
elif sp[-1] == 'dot-norm_i1':
key,i1,dot,norm_i1 = sp[:4]  #key is i2.
if key == old_key:
if i1 not in old_dict['dot']:
old_dict['dot'][i1] = (dot,norm_i1)
else:
if old_key != "":
printout()
old_key = key
old_dict = }
if old_key != "":
printout()

執行指令碼 t.sh：

#!/bin/bash
cat d.m.1 |./reducer_1.py > d.r.1
cat d.m.2 |./reducer_2.py > d.r.2

關於UserCF和ItemCF的那點事

usercf和itemcf是協同過濾中最為古老的兩種演算法，在top n的推薦上被廣泛應用。這兩個演算法之所以重要，是因為他們使用了兩個不同的推薦系統基本假設。usercf認為乙個人會喜歡和他有相同愛好的人喜歡的東西，而itemcf認為乙個人會喜歡和他以前喜歡的東西相似的東西。這兩個假設都有其合理性...

hadoop 基於Streaming實現的編譯

hadoop入門教程基於streaming實現的編譯，在streaming介面實現的程式中，使用者的map和reduce都是單獨的可執行程式，在上節實現中是使用c 實現的，包括map程式wordcountmap.cpp，reduce程式wordcountreduce.cpp。由於寫streamin...

Hadoop的 wordcount 實現字母計數）

首先，在你的hdfs建立目錄 hadoop fs mkdir p wc input 第一，建立你的原檔案 touch wordcount.txt vim wordcount.txt 第二，將你的原檔案上傳到 wc input 中 hadoop fs put home wordcount.txt wc...

itemcf的hadoop實現優化 Python

關於UserCF和ItemCF的那點事

hadoop 基於Streaming實現的編譯

Hadoop的 wordcount 實現字母計數）

相關推薦