对mahout与hadoop的调用关系分析,以及建立统一化平台的思路
mahout的bin目录下mahout文件的执行程序的代码
if [ "$MAHOUT_JOB" = "" ] ; then echo "ERROR: Could not find mahout-examples-*.job in $MAHOUT_HOME or $MAHOUT_HOME/examples/target, please run 'mvn install' to create the .job file" exit 1 else case "$1" in (hadoop) shift export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH exec "$HADOOP_BINARY" "$@" ;; (classpath) echo $CLASSPATH ;; (*) echo "MAHOUT-JOB: $MAHOUT_JOB" export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} exec "$HADOOP_BINARY" jar $MAHOUT_JOB $CLASS "$@" esac fi
显而易见的mahout在调用hadoop跑作业
再看hadoop的bin目录下hadoop文件的执行部分代码
exec "$JSVC" -Dproc_$COMMAND -outfile "$JSVC_OUTFILE" \ -errfile "$JSVC_ERRFILE" \ -pidfile "$HADOOP_SECURE_DN_PID" \ -nodetach \ -user "$HADOOP_SECURE_DN_USER" \ -cp "$CLASSPATH" \ $JAVA_HEAP_MAX $HADOOP_OPTS \ org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter "$@" else # run it exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"
hadoop在调用JSVC或者用java加载java的类
然而有个问题,部分的mahout算法是不依赖的hadoop的
if [ ! -x "$HADOOP_BINARY" ] || [ "$MAHOUT_LOCAL" != "" ] ; then if [ ! -x "$HADOOP_BINARY" ] ; then echo "hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally" elif [ "$MAHOUT_LOCAL" != "" ] ; then echo "MAHOUT_LOCAL is set, running locally" fi CLASSPATH="${CLASSPATH}:${MAHOUT_HOME}/lib/hadoop/*" case $1 in (classpath) echo $CLASSPATH ;; (*) exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@" esac else echo "Running on hadoop, using $HADOOP_BINARY and HADOOP_CONF_DIR=$HADOOP_CONF_DIR"不依赖与hadoop就直接使用java加载类执行算法了
不登录服务器那么如何将mahout、hadoop统一使用呢?
一个快捷省事的思路是写一个站点,将使用的shell命令以web参数的形式提交,然后执行
麻烦一些的话用java程序替代mahout和hadoop中现有的shell脚本直接对jar功能包进行管理,但是这需要对二者内部机制有一定的了解,至少要熟读二者的执行脚本
完全倾向于第一个思路,特别是对java并不熟练,也不想去读shell脚本
不过第二种方法扩展性更好