mahout安装/运行20newsgroup例子
安装
1.下载解压
2.配置环境变量
3.测试
运行20newsgroup例子
准备工作
下载http://qwone.com/~jason/20Newsgroups/
http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
解压上传到hadoop
解压到20news目录下
mkdir 20news-all
cp -R 20news/*/* 20news-all
在hadoop下创建目录并把解压好的文件上传到hadoop
hadoop fs -mkdir ./20news
hadoop fs -put /home/hduser/20news-all ./20news
1.Creating sequence files from 20newsgroups data
mahout seqdirectory -i /user/hduser/20news/20news-all -o /user/hduser/20news/20news-seq -ow
2.Converting sequence files to vectors
mahout seq2sparse -i /user/hduser/20news/20news-seq -o /user/hduser/20news/20news-vectors -lnorm -nv -wt tfidf
3.Creating training and holdout set with a random 80-20 split of the generated vector dataset
mahout split -i /user/hduser/20news/20news-vectors/tfidf-vectors --trainingOutput /user/hduser/20news/20news-train-vectors --testOutput /user/hduser/20news/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
4.Training Naive Bayes model
mahout trainnb -i /user/hduser/20news/20news-train-vectors -el -o /user/hduser/20news/model -li /user/hduser/20news/labelindex -ow $c
5.Self testing on training set
mahout testnb -i /user/hduser/20news/20news-train-vectors -m /user/hduser/20news/model -l /user/hduser/20news/labelindex -ow -o /user/hduser/20news/20news-testing $c
6.Testing on holdout set
mahout testnb -i /user/hduser/20news/20news-test-vectors -m /user/hduser/20news/model -l /user/hduser/20news/labelindex -ow -o /user/hduser/20news/20news-testing $c