mahout安装/运行20newsgroup例子

安装

1.下载解压

2.配置环境变量

3.测试

运行20newsgroup例子 

准备工作

下载http://qwone.com/~jason/20Newsgroups/

http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz

解压上传到hadoop

 

解压到20news目录下

mkdir 20news-all
cp -R 20news/*/* 20news-all

在hadoop下创建目录并把解压好的文件上传到hadoop

hadoop fs -mkdir ./20news

hadoop fs -put /home/hduser/20news-all ./20news

1.Creating sequence files from 20newsgroups data

mahout seqdirectory -i  /user/hduser/20news/20news-all -o /user/hduser/20news/20news-seq -ow

 

 

 

 

2.Converting sequence files to vectors

mahout seq2sparse -i /user/hduser/20news/20news-seq -o /user/hduser/20news/20news-vectors  -lnorm -nv  -wt tfidf

3.Creating training and holdout set with a random 80-20 split of the generated vector dataset

mahout split -i /user/hduser/20news/20news-vectors/tfidf-vectors --trainingOutput /user/hduser/20news/20news-train-vectors --testOutput /user/hduser/20news/20news-test-vectors  --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

 

 

 

4.Training Naive Bayes model

mahout trainnb -i /user/hduser/20news/20news-train-vectors -el -o /user/hduser/20news/model -li /user/hduser/20news/labelindex -ow $c

 

 

 

 

5.Self testing on training set

mahout testnb -i /user/hduser/20news/20news-train-vectors -m /user/hduser/20news/model -l /user/hduser/20news/labelindex -ow -o /user/hduser/20news/20news-testing $c

 

 

 

6.Testing on holdout set

mahout testnb -i /user/hduser/20news/20news-test-vectors -m /user/hduser/20news/model -l /user/hduser/20news/labelindex -ow -o /user/hduser/20news/20news-testing $c

 

 

 

 

posted on 2013-12-13 13:04  ukouryou  阅读(452)  评论(0编辑  收藏  举报

导航