itemcf的hadoop實現優化 Python

2021-06-28 14:18:56 字數 3412 閱讀 4897

原始資料如下:

u1  a,d,b,c

u2 a,a,c

u3 b,d

u4 a,d,c

u5 a,b,c

計算公式使用:sim = u(i)∩u(j) / (u(i)∪u(j))

其中: (u(i)∪u(j)) = u(i) + u(j) -  u(i)∩u(j) 

原始的hadoop實現需要5輪mr,優化後只需要兩輪就可以完成。

之前的輪數過多,主要在於計算(u(i)∪u(j)) 的時候,需要多次更改key,並非計算量大。只需要修改一下傳遞的key,就可以兩輪實現。

#!/usr/bin/python

#-*-coding:utf-8-*-

import sys

for line in sys.stdin:

user,item_str = line.strip().split()

item_list = sorted(list(set(item_str.split(','))))

print "item_str:",item_str,"item_list:",item_list

for i in range(len(item_list)):

i1 = item_list[i]

print i1,1,'norm'

for i2 in item_list[i+1:]:

print i1,i2,1,'dot'

reducer_1.py

#!/usr/bin/python

#-*-coding:utf-8-*-

import sys

def printout():

i1 = old_key

print i1,old_dict['norm'],'norm'

for i2 in old_dict['dot']:

print i1 + "-" + i2,old_dict['dot'][i2],old_dict['norm'],'dot-norm_i1'

old_key = ""

old_dict = }

for line in sys.stdin:

sp = line.strip().split()

if sp[-1] == 'norm':

key,value = sp[:2]

if key == old_key:

old_dict['norm'] += int(value)

else:

if old_key != "":

printout()

old_key = key

# notice: norm part should be int(value)

old_dict = }

elif sp[-1] == 'dot':

key,i2,value = sp[:3]

if key == old_key:

if i2 not in old_dict['dot']:

old_dict['dot'][i2] = 0

old_dict['dot'][i2] += int(value)

else:

if old_dot_key != "":

printout()

old_key = key

old_dict = }

if old_key != "":

printout()

#!/usr/bin/python

#-*-coding:utf-8-*-

import sys

for line in sys.stdin:

sp = line.strip().split()

if sp[-1] == 'norm':

print line.strip()

elif sp[-1] == "dot-norm_i1":

key,dot,norm_i1 = sp[:3]

i1,i2 = key.split('-')

print i2,i1,dot,norm_i1,'dot-norm_i1'

reducer_2.py

#!/usr/bin/python

#-*-coding:utf-8-*-

import sys

def gensim(norm_i1,norm_i2,dot):

return float(dot) / (int(norm_i1) + int(norm_i2) - int(dot))

def printout():

i2 = old_key

norm_i2 = old_dict['norm']

for i1 in old_dict['dot']:

dot,norm_i1 = old_dict['dot'][i1]

sim = gensim(norm_i1,norm_i2,dot)

print i1+"-"+i2,dot,norm_i1,norm_i2,sim,'dot,norm_i1,norm_i2,sim'

old_key = ""

old_dict = }

for line in sys.stdin:

sp = line.strip().split()

if sp[-1] == 'norm':

key,value = sp[:2]

if key == old_key:

old_dict['norm'] = value

else:

if old_key != "":

printout()

old_key = key

old_dict = }

elif sp[-1] == 'dot-norm_i1':

key,i1,dot,norm_i1 = sp[:4] #key is i2.

if key == old_key:

if i1 not in old_dict['dot']:

old_dict['dot'][i1] = (dot,norm_i1)

else:

if old_key != "":

printout()

old_key = key

old_dict = }

if old_key != "":

printout()

執行指令碼 t.sh:

#!/bin/bash

cat d.m.1 |./reducer_1.py > d.r.1

cat d.m.2 |./reducer_2.py > d.r.2

關於UserCF和ItemCF的那點事

usercf和itemcf是協同過濾中最為古老的兩種演算法,在top n的推薦上被廣泛應用。這兩個演算法之所以重要,是因為他們使用了兩個不同的推薦系統基本假設。usercf認為乙個人會喜歡和他有相同愛好的人喜歡的東西,而itemcf認為乙個人會喜歡和他以前喜歡的東西相似的東西。這兩個假設都有其合理性...

hadoop 基於Streaming實現的編譯

hadoop入門教程 基於streaming實現的編譯,在streaming介面實現的程式中,使用者的map和reduce都是單獨的可執行程式,在上節實現中是使用c 實現的,包括map程式wordcountmap.cpp,reduce程式wordcountreduce.cpp。由於寫streamin...

Hadoop的 wordcount 實現字母計數)

首先,在你的hdfs建立目錄 hadoop fs mkdir p wc input 第一,建立你的原檔案 touch wordcount.txt vim wordcount.txt 第二,將你的原檔案上傳到 wc input 中 hadoop fs put home wordcount.txt wc...