概念:
贝叶斯定理:贝叶斯理论是以18世纪的一位神学家托马斯.贝叶斯(Thomas Bayes)命名。通常,事件A在事件B(发生)的条件下的概率,与事件B在事件A(发生)的条件下的概率是不一样的;然而,这两者是有确定的关系的,贝叶斯定理就是这种关系的陈述
朴素贝叶斯:朴素贝叶斯方法是基于贝叶斯定理和特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率(Maximum A Posteriori)最大的输出y。
通俗的来讲,在给定数据集的前提下,对于一个新样本(未分类),在数据集中找到和新样本特征相同的样本,最后根据这些样本算出每个类的概率,概率最高的类即为新样本的类。
运算公式:
P( h | d) = P ( d | h ) * P( h) / P(d)
这里:
P ( h | d ):是因子h基于数据d的假设概率,叫做后验概率
P ( d | h ) : 是假设h为真条件下的数据d的概率
P( h) : 是假设条件h为真的时候的概率(和数据无关),它叫做h的先验概率
P(d) : 数据d的概率,和先验条件无关.
算法实现分解:
1 数据处理:加载数据并把他们分成训练数据和测试数据
2 汇总数据:汇总训练数据的概率以便后续计算概率和做预测
3 结果预测: 通过给定的测试数据和汇总的训练数据做预测
4 评估准确性:使用测试数据来评估预测的准确性
代码实现:
1 # Example of Naive Bayes implemented from Scratch in Python 2 import csv 3 import random 4 import math 5 6 def loadCsv(filename): 7 lines = csv.reader(open(filename, "rb")) 8 dataset = list(lines) 9 for i in range(len(dataset)): 10 dataset[i] = [float(x) for x in dataset[i]] 11 return dataset 12 13 def splitDataset(dataset, splitRatio): 14 trainSize = int(len(dataset) * splitRatio) 15 trainSet = [] 16 copy = list(dataset) 17 while len(trainSet) < trainSize: 18 index = random.randrange(len(copy)) 19 trainSet.append(copy.pop(index)) 20 return [trainSet, copy] 21 22 def separateByClass(dataset): 23 separated = {} 24 for i in range(len(dataset)): 25 vector = dataset[i] 26 if (vector[-1] not in separated): 27 separated[vector[-1]] = [] 28 separated[vector[-1]].append(vector) 29 return separated 30 31 def mean(numbers): 32 return sum(numbers)/float(len(numbers)) 33 34 def stdev(numbers): 35 avg = mean(numbers) 36 variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) 37 return math.sqrt(variance) 38 39 def summarize(dataset): 40 summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)] 41 del summaries[-1] 42 return summaries 43 44 def summarizeByClass(dataset): 45 separated = separateByClass(dataset) 46 summaries = {} 47 for classValue, instances in separated.iteritems(): 48 summaries[classValue] = summarize(instances) 49 return summaries 50 51 def calculateProbability(x, mean, stdev): 52 exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) 53 return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent 54 55 def calculateClassProbabilities(summaries, inputVector): 56 probabilities = {} 57 for classValue, classSummaries in summaries.iteritems(): 58 probabilities[classValue] = 1 59 for i in range(len(classSummaries)): 60 mean, stdev = classSummaries[i] 61 x = inputVector[i] 62 probabilities[classValue] *= calculateProbability(x, mean, stdev) 63 return probabilities 64 65 def predict(summaries, inputVector): 66 probabilities = calculateClassProbabilities(summaries, inputVector) 67 bestLabel, bestProb = None, -1 68 for classValue, probability in probabilities.iteritems(): 69 if bestLabel is None or probability > bestProb: 70 bestProb = probability 71 bestLabel = classValue 72 return bestLabel 73 74 def getPredictions(summaries, testSet): 75 predictions = [] 76 for i in range(len(testSet)): 77 result = predict(summaries, testSet[i]) 78 predictions.append(result) 79 return predictions 80 81 def getAccuracy(testSet, predictions): 82 correct = 0 83 for i in range(len(testSet)): 84 if testSet[i][-1] == predictions[i]: 85 correct += 1 86 return (correct/float(len(testSet))) * 100.0 87 88 def main(): 89 filename = 'pima-indians-diabetes.data.csv' 90 splitRatio = 0.67 91 dataset = loadCsv(filename) 92 trainingSet, testSet = splitDataset(dataset, splitRatio) 93 print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet)) 94 # prepare model 95 summaries = summarizeByClass(trainingSet) 96 # test model 97 predictions = getPredictions(summaries, testSet) 98 accuracy = getAccuracy(testSet, predictions) 99 print('Accuracy: {0}%').format(accuracy) 100 101 main()
pima-indians-diabetes.data.csv的下载地址:
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
参考文档:
1 https://en.wikipedia.org/wiki/Naive_Bayes_classifier
2 https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
3 https://machinelearningmastery.com/naive-bayes-for-machine-learning/
作者:虚生 出处:https://www.cnblogs.com/dylancao/ 以音频和传感器算法为核心的智能可穿戴产品解决方案提供商 ,提供可穿戴智能软硬件解决方案的设计,开发和咨询服务。 勾搭热线:邮箱:1173496664@qq.com weixin:18019245820 市场技术对接群:347609188 |