l转自 http://book.douban.com/annotation/17067489/
Hadoop的自定制数据类型一般有两个办法,一种较为简单的是针对值,另外一种更为完整的是对于键和值都适应的方法:1、实现Writable接口:
- /* DataInput and DataOutput 类是java.io的类 */
- public interface Writable {
- void readFields(DataInput in);
- void write(DataOutput out);
- }
下面是一个小例子:
- public class Point3D implement Writable {
- public float x, y, z;
- public Point3D(float fx, float fy, float fz) {
- this.x = fx;
- this.y = fy;
- this.z = fz;
- }
- public Point3D() {
- this(0.0f, 0.0f, 0.0f);
- }
- public void readFields(DataInput in) throws IOException {
- x = in.readFloat();
- y = in.readFloat();
- z = in.readFloat();
- }
- public void write(DataOutput out) throws IOException {
- out.writeFloat(x);
- out.writeFloat(y);
- out.writeFloat(z);
- }
- public String toString() {
- return Float.toString(x) + ", "
- + Float.toString(y) + ", "
- + Float.toString(z);
- }
- }
2、对于键来说,需要指定排序规则(呃,这句话可能有点C++风格?),对此,Java版Hadoop的办法是实现WritableComparable这个泛型接口,WritableComparable,顾名思义了,一半是Writable,一半是Comparable,有点啰嗦,但简明,据说Java程序员们打字快写?~~
- public interface WritableComparable<T> {
- public void readFields(DataInput in);
- public void write(DataOutput out);
- public int compareTo(T other);
- }
先给出下面的简单例子,再做说明和扩展。
- public class Point3D inplements WritableComparable {
- public float x, y, z;
- public Point3D(float fx, float fy, float fz) {
- this.x = fx;
- this.y = fy;
- this.z = fz;
- }
- public Point3D() {
- this(0.0f, 0.0f, 0.0f);
- }
- public void readFields(DataInput in) throws IOException {
- x = in.readFloat();
- y = in.readFloat();
- z = in.readFloat();
- }
- public void write(DataOutput out) throws IOException {
- out.writeFloat(x);
- out.writeFloat(y);
- out.writeFloat(z);
- }
- public String toString() {
- return Float.toString(x) + ", "
- + Float.toString(y) + ", "
- + Float.toString(z);
- }
- public float distanceFromOrigin() {
- return (float) Math.sqrt( x*x + y*y +z*z);
- }
- public int compareTo(Point3D other) {
- return Float.compareTo(
- distanceFromOrigin(),
- other.distanceFromOrigin());
- }
- public boolean equals(Object o) {
- if( !(o instanceof Point3D)) {
- return false;
- }
- Point3D other = (Point3D) o;
- return this.x == o.x
- && this.y == o.y
- && this.z == o.z;
- }
- /* 实现 hashCode() 方法很重要
- * Hadoop的Partitioners会用到这个方法,后面再说
- */
- public int hashCode() {
- return Float.floatToIntBits(x)
- ^ Float.floatToIntBits(y)
- ^ Float.floatToIntBits(z);
- }
- }
自定义Hadoop数据类型后,需要明确告诉Hadoop来使用它们。这是 JobConf 所能担当的了。使用setOutputKeyClass() / setOutputValueClass()方法即可:
- void setOutputKeyClass(Class<T> theClass)
- void setOutputValueClass(Class<T> theClass)
通常(默认条件下),这个函数对Map和Reduce阶段的输出都起到作用,当然也有专门的 setMapOutputKeyClass() / setReduceOutputKeyClass() 接口。