《机器学习》周志华 习题答案7.3
运用贝叶斯方法对西瓜数据集进行分类,同理代码如下:
file1 = open('c:\quant\watermelon.csv','r')
data = [line.strip('\n').split(',') for line in file1]
data = np.array(data)
X = [[float(raw[-7]),float(raw[-6]),float(raw[-5]),float(raw[-4]),float(raw[-3]), float(raw[-2])] for raw in data[1:,1:-1]]
#X = [[float(raw[-3]), float(raw[-2])] for raw in data[1:]]
y = [1 if raw[-1]=='1' else 0 for raw in data[1:]]
X = np.array(X)
y = np.array(y)
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(X, y).predict(X)
print("Number of mislabeled points out of a total %d points : %d"
% (X.shape[0],(y != y_pred).sum()))
print y
print y_pred
结果如下:Number of mislabeled points out of a total 17 points : 2
[1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0]
如果选取的属性过小,则分类的错误率会增加。