【Hadoop】mahout推荐hadoop协同矩阵-RowSimilarityJob

现在的位置: 首页 > 综合 > 正文

【Hadoop】mahout推荐hadoop协同矩阵-RowSimilarityJob

2014年03月20日 ⁄ 综合 ⁄ 共 5430字 ⁄ 字号小中大 ⁄ 评论关闭

生成协同矩阵是RocommendJob的第二步主要操作，第一步操作生成偏好矩阵分析请点击打开链接。在分析之前如果你有兴趣可以先去看一下mahuot对RowSimilarityJob的简单介绍。英文请点击打开链接。然后还有一个链接个人感觉比较有帮助，是某个人和开发者的一个邮件讨论，点击打开链接

这一步操作以上一步PreparePreferenceMatrixJob生成的偏好矩阵为输入，来对偏好矩阵做进一步的操作生成协同矩阵。这一步也是分为三个子步骤完成，每个子步骤分别有一个mapper与一个reducer完成。下面就对每个子步骤进行分析。

第一个子步骤：

mapper操作VectorNormMapper是对偏好矩阵进行重新组合向量。把以itemId为key，以userId为value的的向量(itemId, VectorWritable<userId, pref>)，再转化成以userId为key，以itemId为value的向量组合(userId, VectorWritable<itemId, pref>)。其实在PreparePreferenceMatrixJob的某一个子步骤中已经生成了格式为(userId, VectorWritable<itemId,
pref>)的向量，但是不知为何又转化为了(itemId, VectorWritable<userId, pref>)的向量，而在这里又做了一步转化。真心没懂。

在mapper的操作中，在清除工作的cleanup方法中也做了相应的输出工作。

protected void map(IntWritable row, VectorWritable vectorWritable, Context ctx)
        throws IOException, InterruptedException {
      //(itemId, VectorWritable<userId, pref>)
      Vector rowVector = similarity.normalize(vectorWritable.get());

      int numNonZeroEntries = 0;
      double maxValue = Double.MIN_VALUE;

      Iterator<Vector.Element> nonZeroElements = rowVector.iterateNonZero();
      while (nonZeroElements.hasNext()) {
        Vector.Element element = nonZeroElements.next();
        RandomAccessSparseVector partialColumnVector = new RandomAccessSparseVector(Integer.MAX_VALUE);
        partialColumnVector.setQuick(row.get(), element.get());
        ctx.write(new IntWritable(element.index()), new VectorWritable(partialColumnVector));

        numNonZeroEntries++;
        if (maxValue < element.get()) {
          maxValue = element.get();
        }
      }

      if (threshold != NO_THRESHOLD) {
        nonZeroEntries.setQuick(row.get(), numNonZeroEntries);
        maxValues.setQuick(row.get(), maxValue);
      }
      norms.setQuick(row.get(), similarity.norm(rowVector));

      ctx.getCounter(Counters.ROWS).increment(1);
    }

    @Override
    protected void cleanup(Context ctx) throws IOException, InterruptedException {
      super.cleanup(ctx);
      // dirty trick
      ctx.write(new IntWritable(NORM_VECTOR_MARKER), new VectorWritable(norms));
      ctx.write(new IntWritable(NUM_NON_ZERO_ENTRIES_VECTOR_MARKER), new VectorWritable(nonZeroEntries));
      ctx.write(new IntWritable(MAXVALUE_VECTOR_MARKER), new VectorWritable(maxValues));
    }

reducer操作MergeVectorsReducer就是对mapper的输出(userId, VectorWritable<itemId, pref>)进行合并集合，把相同userId下的vector合并到一起，然后直接输出。其实在这一步之前也有一步的combiner'操作，combiner操作中是对vector进行了merge操作。这一步的reducer操作中在输出的时候也会根据不同情况进行相应的不同操作。通过看上面mapper代码，也会发现它输出好几种数据，reducer会对这不同种类数据写到不同的位置。

protected void reduce(IntWritable row, Iterable<VectorWritable> partialVectors, Context ctx)
        throws IOException, InterruptedException {
      Vector partialVector = Vectors.merge(partialVectors);

      if (row.get() == NORM_VECTOR_MARKER) {
        Vectors.write(partialVector, normsPath, ctx.getConfiguration());
      } else if (row.get() == MAXVALUE_VECTOR_MARKER) {
        Vectors.write(partialVector, maxValuesPath, ctx.getConfiguration());
      } else if (row.get() == NUM_NON_ZERO_ENTRIES_VECTOR_MARKER) {
        Vectors.write(partialVector, numNonZeroEntriesPath, ctx.getConfiguration(), true);
      } else {
        ctx.write(row, new VectorWritable(partialVector));
      }
    }

第二个子步骤:

mapper操作CooccurrencesMapper把上一步的reduce输出(userId, VectorWritable<itemId,pref>)进行处理，通过循环遍历，对每一个vector中的每一个元素进行相互组合，输出为( itemid_index, vector<itemid_index, value> )。在这里的value值会根据采用不同推荐策略进行不同的计算，如果采用的是基于CountbasedMeasure相关策略，那么当两个item被同一个用户看过的时候，则上述的value就为1。

 protected void map(IntWritable column, VectorWritable occurrenceVector, Context ctx)
        throws IOException, InterruptedException {
      //(userId, VectorWritable<itemId,pref>)
      Vector.Element[] occurrences = Vectors.toArray(occurrenceVector);
      Arrays.sort(occurrences, BY_INDEX);

      int cooccurrences = 0;
      int prunedCooccurrences = 0;
      //第二层for循环，是让m = n的，然后又取出数组的的第m个元素，然后与第n个元素计算，这个时候写出的数据是不是就是自身和自身的一个关系？？？？
      for (int n = 0; n < occurrences.length; n++) {
        Vector.Element occurrenceA = occurrences[n];
        Vector dots = new RandomAccessSparseVector(Integer.MAX_VALUE);
        for (int m = n; m < occurrences.length; m++) {
          Vector.Element occurrenceB = occurrences[m];
          if (threshold == NO_THRESHOLD || consider(occurrenceA, occurrenceB)) {
        	//in CountbasedMeasure aggregate always return 1
            dots.setQuick(occurrenceB.index(), similarity.aggregate(occurrenceA.get(), occurrenceB.get()));
            cooccurrences++;
          } else {
            prunedCooccurrences++;
          }
        }
        ctx.write(new IntWritable(occurrenceA.index()), new VectorWritable(dots));//在这里输出的就是以itemA与所有与其有关系的item的之间的关系
      }
      ctx.getCounter(Counters.COOCCURRENCES).increment(cooccurrences);
      ctx.getCounter(Counters.PRUNED_COOCCURRENCES).increment(prunedCooccurrences);
    }

the input like this ：

column1: row1, row2, row3
column2: row1, row3
column3: row2

the output will be ：

for column1:

(row1,row2)
(row1,row3)
(row2,row3)

for column2:

(row1,row3)

for column3 there's nothing to emit.

reducer操作SimilarityReducer就是对上述单个输出进行汇总计算两两item之间的相似度，生成最后的协同矩阵。

protected void reduce(IntWritable row, Iterable<VectorWritable> partialDots, Context ctx)
        throws IOException, InterruptedException {
    	//(itemIdA, VectorWritable<itemIdB,1>)
      Iterator<VectorWritable> partialDotsIterator = partialDots.iterator();
      Vector dots = partialDotsIterator.next().get();
      while (partialDotsIterator.hasNext()) {
        Vector toAdd = partialDotsIterator.next().get();
        Iterator<Vector.Element> nonZeroElements = toAdd.iterateNonZero();
        while (nonZeroElements.hasNext()) {
          Vector.Element nonZeroElement = nonZeroElements.next();
          //对与row有关系的相同itemId的score进行累加      
          dots.setQuick(nonZeroElement.index(), dots.getQuick(nonZeroElement.index()) + nonZeroElement.get());
        }
      }

      Vector similarities = dots.like();
      double normA = norms.getQuick(row.get());
      Iterator<Vector.Element> dotsWith = dots.iterateNonZero();
      while (dotsWith.hasNext()) {
        Vector.Element b = dotsWith.next();
        double similarityValue = similarity.similarity(b.get(), normA, norms.getQuick(b.index()), numberOfColumns);
        if (similarityValue >= treshold) {
          similarities.set(b.index(), similarityValue);
        }
      }
      if (excludeSelfSimilarity) {
        similarities.setQuick(row.get(), 0);
      }
      ctx.write(row, new VectorWritable(similarities));
    }
  }

【上篇】解决几个疑问？
【下篇】用Shell来运行dos下的ping命令

作者: tenderly

该日志由 tenderly 于10年前发表在综合分类下，最后更新于 2014年03月20日.
转载请注明: 【Hadoop】mahout推荐hadoop协同矩阵-RowSimilarityJob | 学步园 +复制链接

抱歉!评论已关闭.

学步园

【Hadoop】mahout推荐hadoop协同矩阵-RowSimilarityJob

作者: tenderly

书签

最新文章New

本站推荐

返回首页