用protobuf替换hadoop中rpc的返回值

cloudtech

浏览: 4610165 次
性别:
来自: 武汉

最近访客更多访客>>

u012363178

devcang

robinjim

JasonWo

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (6378)

社区版块

存档分类

2013-03 ( 15)
2013-02 ( 44)
2013-01 ( 243)
更多存档...

protobuf是谷歌开发的一套序列化结构化数据用以通讯协议，存储数据等的框架，支持c++、java、python等语言。Hadoop 0.23及之后版本将使用protobuf实现rpc的序列化及反序列化。这里做了一个实验，在hadoop 0.19上实现用protobuf序列化/反序列一个rpc的返回值。

使用protobuf需要首先下载并安装，大概步骤是下载并解压tar包后，依次执行下面步骤：

[protobuf-2.4.1]$ ./configure

[protobuf-2.4.1]$ make

[protobuf-2.4.1]$ make check

[protobuf-2.4.1]$ make install # 该步骤需要root权限(sudo)

由于hadoop使用的java语言，需要到java目录下编译jar包，步骤如下：

[protobuf-2.4.1/java]$ mvn test # 需要本地先安装maven哦

[protobuf-2.4.1/java]$ mvn install

[protobuf-2.4.1/java]$ mvn package #该步骤会在target目录上生成一个jar包，

#包名为：protobuf-java-2.4.1.jar 该jar包需要放到hadoop的lib目录下，

#供编译及运行时使用。

本地环境安装完后，下一步是写proto文件。为了方便，这里选择了用proto写ClusterStatus类，对应的proto文件内容如下：

proto文件写好后，用protoc工具编译下生成对应的java文件，命令行如下：

protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/my.proto

其实主要是用--java_out制定下生成的java文件放到哪里去。上面的proto文件编译后生成的文件为: org/apache/hadoop/mapred/ClusterStatusProtos.java。

下一步是修改ClusterStatus.java文件，为了不需要改变ObjectWritable.java中hadoop对象的序列化方式，我们在ClusterStatus类中包装了一个ClusterStatusProtos.ClusterStatus.Builder对象status，去掉该类的所有成员变量并把所有接口改为操作status对象，对应的diff文件如下：

<span style="font-size: 18px;">Index: src/mapred/org/apache/hadoop/mapred/ClusterStatus.java

===================================================================

--- src/mapred/org/apache/hadoop/mapred/ClusterStatus.java      (revision 106658)

+++ src/mapred/org/apache/hadoop/mapred/ClusterStatus.java      (working copy)

@@ -21,9 +21,12 @@

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

+import java.io.InputStream;

import org.apache.hadoop.io.Writable;

-import org.apache.hadoop.io.WritableUtils;

+import org.apache.hadoop.io.DataOutputOutputStream;

+import org.apache.hadoop.mapred.ClusterStatusProtos;

+import org.apache.hadoop.mapred.ClusterStatusProtos.ClusterStatus.JTState;

/**

* Status information on the current state of the Map-Reduce cluster.

@@ -50,14 +53,10 @@

* @see JobClient

public class ClusterStatus implements Writable {

+  ClusterStatusProtos.ClusterStatus.Builder status =

+    ClusterStatusProtos.ClusterStatus.newBuilder();

-  private int task_trackers;

-  private int map_tasks;

-  private int reduce_tasks;

-  private int max_map_tasks;

-  private int max_reduce_tasks;

-  private JobTracker.State state;

ClusterStatus() {}

/**

@@ -70,13 +69,13 @@

* @param state the {@link JobTracker.State} of the <code>JobTracker</code>

*/
   ClusterStatus(int trackers, int maps, int reduces, int maxMaps,

-                int maxReduces, JobTracker.State state) {

-    task_trackers = trackers;

-    map_tasks = maps;

-    reduce_tasks = reduces;

-    max_map_tasks = maxMaps;

-    max_reduce_tasks = maxReduces;

-    this.state = state;

+                int maxReduces, JTState state) {

+    status.setTaskTrackers(trackers)

+    .setMapTasks(maps)

+    .setReduceTasks(reduces)

+    .setMaxMapTasks(maxMaps)

+    .setMaxReduceTasks(maxReduces)

+    .setState(state);

}

@@ -86,7 +85,7 @@

* @return the number of task trackers in the cluster.

public int getTaskTrackers() {

-    return task_trackers;

+    return status.getTaskTrackers();

}

/**

@@ -95,7 +94,7 @@

* @return the number of currently running map tasks in the cluster.

public int getMapTasks() {

-    return map_tasks;

+    return status.getMapTasks();

}

/**

@@ -104,7 +103,7 @@

* @return the number of currently running reduce tasks in the cluster.

public int getReduceTasks() {

-    return reduce_tasks;

+    return status.getReduceTasks();

}

/**

@@ -113,7 +112,7 @@

* @return the maximum capacity for running map tasks in the cluster.

public int getMaxMapTasks() {

-    return max_map_tasks;

+    return status.getMaxMapTasks();

}

/**

@@ -122,7 +121,7 @@

* @return the maximum capacity for running reduce tasks in the cluster.

public int getMaxReduceTasks() {

-    return max_reduce_tasks;

+    return status.getMaxReduceTasks();

}

/**

@@ -131,26 +130,17 @@

* @return the current state of the <code>JobTracker</code>.

-  public JobTracker.State getJobTrackerState() {

-    return state;

+  public JTState getJobTrackerState() {

+    return status.getState();

}

public void write(DataOutput out) throws IOException {

-    out.writeInt(task_trackers);

-    out.writeInt(map_tasks);

-    out.writeInt(reduce_tasks);

-    out.writeInt(max_map_tasks);

-    out.writeInt(max_reduce_tasks);

-    WritableUtils.writeEnum(out, state);

+    status.build().writeDelimitedTo(

+        DataOutputOutputStream.constructOutputStream(out));

}

public void readFields(DataInput in) throws IOException {

-    task_trackers = in.readInt();

-    map_tasks = in.readInt();

-    reduce_tasks = in.readInt();

-    max_map_tasks = in.readInt();

-    max_reduce_tasks = in.readInt();

-    state = WritableUtils.readEnum(in, JobTracker.State.class);

+    status = ClusterStatusProtos.ClusterStatus.parseDelimitedFrom((InputStream)in).toBuilder();

}

</span>

需要注意的是，由于protobuf在输入或输出时只接受InputStream/OutputStream对象，而hadoop的输入流是DataOutput/DataInput，这就需要进行转换。对于DataOutput到OutputStream的转换，采用HADOOP-7379的DataOutputOutputStream.java的wrap方法，而DataInput到InputStream，则是通过强制转换实现的。然后调整JobTracker.java及LocalJobRunner.java等文件后，就可以编译通过并运行了。

遗憾的是，修改后没能测试性能优势。

可能遇见的问题：

编译报错或运行时错误，提示类或方法找不到。

解决方案：

1、确保protobuf-java-2.4.1.jar包已经放到hadoop的lib目录下了；

2、全量编译：ant clean; ant。

参考资料：

1、http://code.google.com/apis/protocolbuffers/docs/javatutorial.html

2、https://issues.apache.org/jira/browse/HADOOP-7379

分享到：

计算机网络英文名词和知识点整理 | openssh无密码访问

2011-10-06 16:15
浏览 793
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论