现在的位置: 首页 > 综合 > 正文

【Hadoop】mahout推荐hadoop协同矩阵-RowSimilarityJob

2014年03月20日 ⁄ 综合 ⁄ 共 5430字 ⁄ 字号 评论关闭

生成协同矩阵是RocommendJob的第二步主要操作,第一步操作生成偏好矩阵分析请点击打开链接。在分析之前如果你有兴趣可以先去看一下mahuot对RowSimilarityJob的简单介绍。英文请点击打开链接。然后还有一个链接个人感觉比较有帮助,是某个人和开发者的一个邮件讨论,点击打开链接

这一步操作以上一步PreparePreferenceMatrixJob生成的偏好矩阵为输入,来对偏好矩阵做进一步的操作生成协同矩阵。这一步也是分为三个子步骤完成,每个子步骤分别有一个mapper与一个reducer完成。下面就对每个子步骤进行分析。

第一个子步骤:

mapper操作VectorNormMapper是对偏好矩阵进行重新组合向量。把以itemId为key,以userId为value的的向量(itemId, VectorWritable<userId, pref>),再转化成以userId为key,以itemId为value的向量组合(userId, VectorWritable<itemId, pref>)。其实在PreparePreferenceMatrixJob的某一个子步骤中已经生成了格式为(userId, VectorWritable<itemId,
pref>)的向量,但是不知为何又转化为了(itemId, VectorWritable<userId, pref>)的向量,而在这里又做了一步转化。真心没懂

在mapper的操作中,在清除工作的cleanup方法中也做了相应的输出工作。

protected void map(IntWritable row, VectorWritable vectorWritable, Context ctx)
        throws IOException, InterruptedException {
      //(itemId, VectorWritable<userId, pref>)
      Vector rowVector = similarity.normalize(vectorWritable.get());

      int numNonZeroEntries = 0;
      double maxValue = Double.MIN_VALUE;

      Iterator<Vector.Element> nonZeroElements = rowVector.iterateNonZero();
      while (nonZeroElements.hasNext()) {
        Vector.Element element = nonZeroElements.next();
        RandomAccessSparseVector partialColumnVector = new RandomAccessSparseVector(Integer.MAX_VALUE);
        partialColumnVector.setQuick(row.get(), element.get());
        ctx.write(new IntWritable(element.index()), new VectorWritable(partialColumnVector));

        numNonZeroEntries++;
        if (maxValue < element.get()) {
          maxValue = element.get();
        }
      }

      if (threshold != NO_THRESHOLD) {
        nonZeroEntries.setQuick(row.get(), numNonZeroEntries);
        maxValues.setQuick(row.get(), maxValue);
      }
      norms.setQuick(row.get(), similarity.norm(rowVector));

      ctx.getCounter(Counters.ROWS).increment(1);
    }

    @Override
    protected void cleanup(Context ctx) throws IOException, InterruptedException {
      super.cleanup(ctx);
      // dirty trick
      ctx.write(new IntWritable(NORM_VECTOR_MARKER), new VectorWritable(norms));
      ctx.write(new IntWritable(NUM_NON_ZERO_ENTRIES_VECTOR_MARKER), new VectorWritable(nonZeroEntries));
      ctx.write(new IntWritable(MAXVALUE_VECTOR_MARKER), new VectorWritable(maxValues));
    }

reducer操作MergeVectorsReducer就是对mapper的输出(userId, VectorWritable<itemId, pref>)进行合并集合,把相同userId下的vector合并到一起,然后直接输出。其实在这一步之前也有一步的combiner'操作,combiner操作中是对vector进行了merge操作。这一步的reducer操作中在输出的时候也会根据不同情况进行相应的不同操作。通过看上面mapper代码,也会发现它输出好几种数据,reducer会对这不同种类数据写到不同的位置。

protected void reduce(IntWritable row, Iterable<VectorWritable> partialVectors, Context ctx)
        throws IOException, InterruptedException {
      Vector partialVector = Vectors.merge(partialVectors);

      if (row.get() == NORM_VECTOR_MARKER) {
        Vectors.write(partialVector, normsPath, ctx.getConfiguration());
      } else if (row.get() == MAXVALUE_VECTOR_MARKER) {
        Vectors.write(partialVector, maxValuesPath, ctx.getConfiguration());
      } else if (row.get() == NUM_NON_ZERO_ENTRIES_VECTOR_MARKER) {
        Vectors.write(partialVector, numNonZeroEntriesPath, ctx.getConfiguration(), true);
      } else {
        ctx.write(row, new VectorWritable(partialVector));
      }
    }

第二个子步骤:

mapper操作CooccurrencesMapper把上一步的reduce输出(userId, VectorWritable<itemId,pref>)进行处理,通过循环遍历,对每一个vector中的每一个元素进行相互组合,输出为( itemid_index, vector<itemid_index, value> )。在这里的value值会根据采用不同推荐策略进行不同的计算,如果采用的是基于CountbasedMeasure相关策略,那么当两个item被同一个用户看过的时候,则上述的value就为1。

 protected void map(IntWritable column, VectorWritable occurrenceVector, Context ctx)
        throws IOException, InterruptedException {
      //(userId, VectorWritable<itemId,pref>)
      Vector.Element[] occurrences = Vectors.toArray(occurrenceVector);
      Arrays.sort(occurrences, BY_INDEX);

      int cooccurrences = 0;
      int prunedCooccurrences = 0;
      //第二层for循环,是让m = n的,然后又取出数组的的第m个元素,然后与第n个元素计算,这个时候写出的数据是不是就是自身和自身的一个关系????
      for (int n = 0; n < occurrences.length; n++) {
        Vector.Element occurrenceA = occurrences[n];
        Vector dots = new RandomAccessSparseVector(Integer.MAX_VALUE);
        for (int m = n; m < occurrences.length; m++) {
          Vector.Element occurrenceB = occurrences[m];
          if (threshold == NO_THRESHOLD || consider(occurrenceA, occurrenceB)) {
        	//in CountbasedMeasure aggregate always return 1
            dots.setQuick(occurrenceB.index(), similarity.aggregate(occurrenceA.get(), occurrenceB.get()));
            cooccurrences++;
          } else {
            prunedCooccurrences++;
          }
        }
        ctx.write(new IntWritable(occurrenceA.index()), new VectorWritable(dots));//在这里输出的就是以itemA与所有与其有关系的item的之间的关系
      }
      ctx.getCounter(Counters.COOCCURRENCES).increment(cooccurrences);
      ctx.getCounter(Counters.PRUNED_COOCCURRENCES).increment(prunedCooccurrences);
    }

the input like this :

column1: row1, row2, row3
column2: row1, row3
column3: row2

the output will be :

for column1:

(row1,row2)
(row1,row3)
(row2,row3)

for column2:

(row1,row3)

for column3 there's nothing to emit.

reducer操作SimilarityReducer就是对上述单个输出进行汇总计算两两item之间的相似度,生成最后的协同矩阵。

protected void reduce(IntWritable row, Iterable<VectorWritable> partialDots, Context ctx)
        throws IOException, InterruptedException {
    	//(itemIdA, VectorWritable<itemIdB,1>)
      Iterator<VectorWritable> partialDotsIterator = partialDots.iterator();
      Vector dots = partialDotsIterator.next().get();
      while (partialDotsIterator.hasNext()) {
        Vector toAdd = partialDotsIterator.next().get();
        Iterator<Vector.Element> nonZeroElements = toAdd.iterateNonZero();
        while (nonZeroElements.hasNext()) {
          Vector.Element nonZeroElement = nonZeroElements.next();
          //对与row有关系的相同itemId的score进行累加      
          dots.setQuick(nonZeroElement.index(), dots.getQuick(nonZeroElement.index()) + nonZeroElement.get());
        }
      }

      Vector similarities = dots.like();
      double normA = norms.getQuick(row.get());
      Iterator<Vector.Element> dotsWith = dots.iterateNonZero();
      while (dotsWith.hasNext()) {
        Vector.Element b = dotsWith.next();
        double similarityValue = similarity.similarity(b.get(), normA, norms.getQuick(b.index()), numberOfColumns);
        if (similarityValue >= treshold) {
          similarities.set(b.index(), similarityValue);
        }
      }
      if (excludeSelfSimilarity) {
        similarities.setQuick(row.get(), 0);
      }
      ctx.write(row, new VectorWritable(similarities));
    }
  }

抱歉!评论已关闭.