FisherLDA

 一  问题描述

• 在 UCI 数据集商的 iris 和 sonar 数据上验证算法的有效性;

• 训练和测试样本有三种方式进行划分:

1)将数据随机分训练样本和测试,多次平均求结果

2)k 折交叉验证

3)留 1 法

二  解决问题的方法
※ 总体概览和声明:

用 fisher 判别分析法训练数据并进行测试。

所用编程语言:Python 所用样本划分方法:K 折交叉验证

所画结果分为两类:iris 数据集的测试样本分布图。

Sonar 数据集经过 3,10,20,30, 40,50,60 维训练后准确度折线图。

1. Fisher 判别分析原理和流程 原理:Fisher linear discriminant analysis(简称 Fisher LDA)是由 FSHER 在 1936 年提出来 的方法。主要思想是将高维样本通过数学变换投影到一维空间中再对其进行分类。如图:

我们通过计算类内离散度矩阵和类间离散度矩阵来判别样本。

假设有N个样本为分别为X=[X1,X2 ……… Xn];用来区分二分类的直线(投影函数)为:

类均值向量为 :

定义各类类内离散度矩阵以及总类内离散度矩阵为:

类间离散度矩阵为:

投影后,均值变为:

类内离散度矩阵变为一个值(对于两类问题):

 

 

类间离散度矩阵变为两类均值差的平方:

此处m上应该有波浪号

通过化简可以求得Fisher判别准则变为:式1.1

 

 

我们的目的就是求使得式1.1最大的投影方向W。因为W幅值的调节不会影响其方向,因此该问题变为一个优化问题:

Max   WT*Sb*W(WT是W的转置)

s.t.    WT*Sw*W=c(c不等于0)

此时,可引入拉格朗日乘子使该问题转化为一个无约束的优化问题:

L(w,d)= WT*Sb*W -d(WT*Sw*W-c)

通过化简可以得到:W*=Sw求逆*(m1-m2)

代码流程:

1, 取数据,并将其划分为训练集和测试集,此处使用K折交叉验证,通过调参,最终发现7折或8折交叉准确率最大。此处取7折交叉结果作为示例。划分数据的代码存于(util)文件中。对样本标签进行编码。

2, 计算d维的各类的类均值向量Mi

3, 计算类内离散度矩阵和类间离散度矩阵向量

4, 求解矩阵广义特征值问题

5, 为新特征子空间选择线性判别法

6, 将样本转化为新的子空间

7, 计算错误率和准确率

8, 画出训练集的分布关系(只针对iris数据集)

9, 改变样本维度,循环。最后计算所有准确率的平均数(只针对sonar数据集)

10,  画出准确率随着样本维度增加的折线图(只针对sonar数据集)

Iris数据集代码文件:fisherLDA主要程序, util1将数据划分为数据集和训练集的程序, load数据输入转换成矩阵的程序

iris数据集文件:fisherLDA1, util2, load1功能与上面一一对应

这些代码多数来源于github,如有需要请前去搜索,很容易搜到。笔者只进行了一些改动来满足不同的数据。代码环境,pycharm编辑器。win10,64位anaconda,记得将util和load类型的文件保存在Lib当中,python3.6

  1 #!/usr/bin/python3
  2 from load import loader as loader    #作者自己写的程序
  3 from sklearn.preprocessing import LabelEncoder
  4 import numpy as np
  5 from log import log                  #作者自己写的程序
  6 from matplotlib import pyplot as plt
  7 import util1 as util1                #作者自己写的程序
  8 
  9 #第一个函数,计算均值向量
 10 def computeMeanVec(X, y, uniqueClass):
 11     """
 12     Step 1: Computing the d-dimensional mean vectors for different class 计算均值向量
 13     """
 14     np.set_printoptions(precision=4)#保留四位小数
 15 
 16     mean_vectors = []
 17     for cl in range(1,len(uniqueClass)+1):
 18         mean_vectors.append(np.mean(X[y==cl], axis=0)) #axis = 0:压缩行,对各列求均值,返回 1* n 矩阵
 19         log('Mean Vector class %s: %s\n' %(cl, mean_vectors[cl-1]))#
 20     return mean_vectors
 21 
 22 #第二个函数,计算类内离散度矩阵
 23 def computeWithinScatterMatrices(X, y, feature_no, uniqueClass, mean_vectors):
 24     # 2.1 Within-class scatter matrix
 25     S_W = np.zeros((feature_no, feature_no))#初始化矩阵吧
 26     for cl,mv in zip(range(1,len(uniqueClass)+1), mean_vectors):
 27         class_sc_mat = np.zeros((feature_no, feature_no))  # scatter matrix for every class
 28         for row in X[y == cl]:
 29             row, mv = row.reshape(feature_no,1), mv.reshape(feature_no,1)   # make column vectors
 30             class_sc_mat += (row-mv).dot((row-mv).T)
 31         S_W += class_sc_mat                                # sum class scatter matrices
 32     log('within-class Scatter Matrix: {}\n'.format(S_W))
 33 
 34     return S_W
 35 
 36 #计算类间离散度矩阵
 37 def computeBetweenClassScatterMatrices(X, y, feature_no, mean_vectors):
 38     # 2.2 Between-class scatter matrix
 39     overall_mean = np.mean(X, axis=0)
 40 
 41     S_B = np.zeros((feature_no, feature_no))
 42     for i,mean_vec in enumerate(mean_vectors):
 43         n = X[y==i+1,:].shape[0]
 44         mean_vec = mean_vec.reshape(feature_no, 1) # make column vector
 45         overall_mean = overall_mean.reshape(feature_no, 1) # make column vector
 46         S_B += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)
 47     log('between-class Scatter Matrix: {}\n'.format(S_B))
 48 
 49     return S_B
 50 
 51 def computeEigenDecom(S_W, S_B, feature_no):
 52     """
 53     Step 3: Solving the generalized eigenvalue problem for the matrix S_W^-1 * S_B
 54     求解矩阵 S_W^-1 * S_B 的广义特征值问题 
 55     """
 56     m = 10^-6 # add a very small value to the diagonal of your matrix before inversion
 57     #在反演之前向矩阵的对角线增加一个非常小的值
 58     # noinspection PyTypeChecker
 59     eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W+np.eye(S_W.shape[1])*m).dot(S_B))
 60 
 61     for i in range(len(eig_vals)):
 62         eigvec_sc = eig_vecs[:,i].reshape(feature_no, 1)
 63         log('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
 64         log('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))
 65 
 66     for i in range(len(eig_vals)):
 67         eigv = eig_vecs[:,i].reshape(feature_no, 1)
 68         # noinspection PyTypeChecker
 69         np.testing.assert_array_almost_equal(np.linalg.inv(S_W+np.eye(S_W.shape[1])*m).dot(S_B).dot(eigv),
 70                                              eig_vals[i] * eigv,
 71                                              decimal=6, err_msg='', verbose=True)
 72     log('Eigenvalue Decomposition OK')
 73 
 74     return eig_vals, eig_vecs
 75 
 76 def selectFeature(eig_vals, eig_vecs, feature_no):
 77     """
 78     Step 4: Selecting linear discriminants for the new feature subspace
 79     """
 80     # 4.1. Sorting the eigenvectors by decreasing eigenvalues
 81     # Make a list of (eigenvalue, eigenvector) tuples
 82     eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
 83 
 84     # Sort the (eigenvalue, eigenvector) tuples from high to low by the value of eigenvalue
 85     eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)
 86 
 87     log('Eigenvalues in decreasing order:\n')
 88     for i in eig_pairs:
 89         log(i[0])
 90 
 91     log('Variance explained:\n')
 92     eigv_sum = sum(eig_vals)
 93     for i,j in enumerate(eig_pairs):
 94         log('eigenvalue {0:}: {1:.2%}'.format(i+1, (j[0]/eigv_sum).real))
 95 
 96     # 4.2. Choosing k eigenvectors with the largest eigenvalues - here I choose the first two eigenvalues
 97     W = np.hstack((eig_pairs[0][1].reshape(feature_no, 1), eig_pairs[1][1].reshape(feature_no, 1)))
 98     log('Matrix W: \n{}'.format(W.real))
 99 
100     return W
101 
102 def transformToNewSpace(X, W, sample_no, mean_vectors, uniqueClass):
103     """
104     Step 5: Transforming the samples onto the new subspace
105     """
106     X_trans = X.dot(W)
107     mean_vecs_trans = []
108     for i in range(len(uniqueClass)):
109         mean_vecs_trans.append(mean_vectors[i].dot(W))
110 
111     #assert X_trans.shape == (sample_no,2), "The matrix is not size of (sample number, 2) dimensional."
112 
113     return X_trans, mean_vecs_trans
114 
115 def computeErrorRate(X_trans, mean_vecs_trans, y):
116     """
117     Compute the error rate
118     """
119 
120     """
121     Project to the second largest eigenvalue
122     """
123     uniqueClass = np.unique(y)
124     threshold = 0
125     for i in range(len(uniqueClass)):
126         threshold += mean_vecs_trans[i][1]
127     threshold /= len(uniqueClass)
128     log("threshold: {}".format(threshold))
129 
130     errors = 0
131     for (i,cl) in enumerate(uniqueClass):
132         label = cl
133         tmp = X_trans[y==label, 1]
134         # compute the error numbers for class i
135         num = len(tmp[tmp<threshold]) if mean_vecs_trans[i][1] > threshold else len(tmp[tmp>=threshold])
136         log("error rate in class {} = {}".format(i, num*1.0/len(tmp)))
137         errors += num
138 
139     errorRate = errors*1.0/X_trans.shape[0]
140     log("Error rate for the second largest eigenvalue = {}".format(errorRate))
141     log("Accuracy for the second largest eigenvalue = {}".format(1-errorRate))
142 
143 
144     """
145     Project to the largest eigenvalue - and return
146     """
147     uniqueClass = np.unique(y)
148     threshold = 0
149     for i in range(len(uniqueClass)):
150         threshold += mean_vecs_trans[i][0]
151     threshold /= len(uniqueClass)
152     log("threshold: {}".format(threshold))
153 
154     errors = 0
155     for (i,cl) in enumerate(uniqueClass):
156         label = cl
157         tmp = X_trans[y==label, 0]
158         # compute the error numbers for class i
159         num = len(tmp[tmp<threshold]) if mean_vecs_trans[i][0] > threshold else len(tmp[tmp>=threshold])
160         log("error rate in class {} = {}".format(i, num*1.0/len(tmp)))
161         errors += num
162 
163     errorRate = errors*1.0/X_trans.shape[0]
164     log("Error rate = {}".format(errorRate))
165     log("Accuracy = {}".format(1-errorRate))
166 
167     return 1-errorRate, threshold
168 
169 
170 
171 
172 def plot_step_lda(X_trans, y, label_dict, uniqueClass, dataset, threshold):
173 
174     ax = plt.subplot(111)
175     for label,marker,color in zip(range(1, len(uniqueClass)+1),('^', 's','d'),('blue', 'red','green')):
176         plt.scatter(x=X_trans[:,0].real[y == label],
177                     y=X_trans[:,1].real[y == label],
178                     marker=marker,
179                     color=color,
180                     alpha=0.5,
181                     label=label_dict[label]
182                     )
183 
184     plt.xlabel('LDA1')
185     plt.ylabel('LDA2')
186 
187     leg = plt.legend(loc='upper right', fancybox=True)
188     #loc 图例所有figure位置,右上。 fancybox控制是否应在构成图例背景的FancyBboxPatch周围启用圆边
189     leg.get_frame().set_alpha(0.5)
190     plt.title('Fisher LDA: {} data projection onto the first 2 linear discriminants'.format(dataset))
191 
192     # plot the the threshold line  绘出分界线
193     [bottom, up] = ax.get_ylim()
194     #plt.axvline(x=threshold.real, ymin=bottom, ymax=0.3, linewidth=2, color='k', linestyle='--')
195     plt.axvline(threshold.real, linewidth=2, color='g')
196 
197     # hide axis ticks
198     plt.tick_params(axis="both", which="both", bottom="off", top="off",
199             labelbottom="on", left="off", right="off", labelleft="on")
200 
201     # remove axis spines
202     ax.spines["top"].set_visible(False)
203     ax.spines["right"].set_visible(False)
204     ax.spines["bottom"].set_visible(False)
205     ax.spines["left"].set_visible(False)
206 
207     plt.grid()
208     #plt.tight_layout
209     plt.show()
210 
211 def mainFisherLDAtest(dataset='iris', alpha=0.5):  #def mainFisherLDAtest(dataset='sonar', alpha=0.5):
212     # load data
213     path = dataset + '/' + dataset + '.data'
214     load = loader(path)#取数据
215     [X, y] = load.load()
216     # noinspection PyUnresolvedReferences
217     [X, y, testX, testY] = util1.divide(X, y, alpha)#训练集和训练集标签,测试集和测试集标签
218     X = np.array(X) #数组
219     testX = np.array(testX)#测试集
220 
221     feature_no = X.shape[1] # define the dimension读取矩阵的维度,返回列数也就是样本维度
222     print(feature_no)
223     sample_no = X.shape[0] # define the sample number返回行数也就是样本个数
224 
225     # preprocessing
226     enc = LabelEncoder()
227     label_encoder = enc.fit(y)#对Y进行编码
228 
229 
230     y = label_encoder.transform(y) + 1#将Y变成正数
231     print(y)
232     testY = label_encoder.transform(testY) + 1#与上文一样
233     # noinspection PyTypeChecker
234     uniqueClass = np.unique(y) # define how many class in the outputs确定有几类输出
235     label_dict = {}   # define the label name
236     for i in range(1, len(uniqueClass)+1):
237         label_dict[i] = "Class"+str(i)
238     log(label_dict)       #打印类别
239 
240     # Step 1: Computing the d-dimensional mean vectors for different class
241     mean_vectors = computeMeanVec(X, y, uniqueClass)
242 
243 
244     # Step 2: Computing the Scatter Matrices
245     S_W = computeWithinScatterMatrices(X, y, feature_no, uniqueClass, mean_vectors)
246     S_B = computeBetweenClassScatterMatrices(X, y, feature_no, mean_vectors)
247 
248     # Step 3: Solving the generalized eigenvalue problem for the matrix S_W^-1 * S_B
249     eig_vals, eig_vecs = computeEigenDecom(S_W, S_B, feature_no)
250 
251     # Step 4: Selecting linear discriminants for the new feature subspace为新特征子空间选择线性判别法
252     W = selectFeature(eig_vals, eig_vecs, feature_no)
253 
254     # Step 5: Transforming the samples onto the new subspace将样本转化为新的子空间
255     X_trans, mean_vecs_trans = transformToNewSpace(testX, W, sample_no, mean_vectors, uniqueClass)
256 
257 
258     # Step 6: compute error rate
259     accuracy, threshold = computeErrorRate(X_trans, mean_vecs_trans, testY)
260 
261 
262     # plot
263     plot_step_lda(X_trans, testY, label_dict, uniqueClass, dataset, threshold)
264     #原文这句话被注释了plot_step_lda(X_trans, testY, label_dict, uniqueClass, dataset, threshold)
265 
266     return accuracy
267 
268 
269 
270 
271 if __name__ == "__main__":
272     dataset = ['ionosphere', 'iris']  # choose the dataset
273     alpha = 0.5  # choose the train data percentage
274     accuracy = mainFisherLDAtest(dataset[1], alpha)
275     print(accuracy)

iris数据导入代码和分割为训练集和测试集代码

 1 import pandas as pd
 2 
 3 
 4 
 5 class loader(object):
 6 
 7 
 8 
 9     def __init__(self,file_name, split_token = ','):#def __init__(self,file_name , split_token = ','):
10 
11         self.file_name = 'iris/iris.data'
12 
13         self.split_token = split_token
14 
15 
16 
17     def load(self):
18 
19         df = pd.io.parsers.read_csv(
20 
21             filepath_or_buffer='请在此处写上自己数据的本地路径', #  原文 filepath_or_buffer = self.file_name,笔者用的windows,在路径中用了反斜杠
22 
23                             header = None,
24 
25                             sep = self.split_token)     #原文 sep = self.split_token)
26 
27 
28         print(df)
29         print(df.shape[1])
30         self.dimension = df.shape[1]-1
31         print(self.dimension)
32 
33         df.columns = list(range(self.dimension)) + ['label']     #原文df.columns = range(self.dimension) + ['label']
34         print(df.columns)
35         df.dropna(how = 'all', inplace = True) # to drop the empty line at file-end 滤除缺失数据
36 
37         df.tail() #查看低端数据
38 
39 
40 
41         self.X = df[list(range(self.dimension))].values           # 原文self.X = df[range(self.dimension)].values
42 
43         self.Y = df['label'].values
44 
45         distinct_label = list(set(self.Y))        #创建分类标签列表
46 
47 
48 
49         if len(distinct_label) != 3:           #原文 if len(distinct_label) != 2:
50 
51             raise Exception('Three Labels Required\n','Label Sets:\n',distinct_label)
52             #原文 raise Exception('Two Labels Required\n','Label Sets:\n',distinct_label)
53             #raise Exception是一个函数,用于线程出现异常时抛出错误
54 
55 
56         self.Y = list(map(lambda x: {distinct_label[0]:1, distinct_label[1]:2,distinct_label[2]:3}[x], self.Y))
57         #原文 self.Y = map(lambda x: {distinct_label[0]: 1, distinct_label[1]: -1}[x], self.Y)
58         print(self.Y)
59 
60         return [self.X, self.Y]
61 
62 
63 
64 if __name__ == "__main__":
65 
66     loader = loader('iris/iris.data')        #原文 loader = loader('ionosphere/ionosphere.data')
67 
68     [X, y] = loader.load()
69 
70     print (X.shape)#X是整个数据的大小,Y是标签,共有三个标签
  1 import random
  2 
  3 import math
  4 
  5 import numpy as np
  6 
  7 #将数据分为训练集和数据集
  8 
  9 def divide(dataX, dataY, alpha):#x是数据,y是标签
 10 
 11     [positiveX, negativeX,X2] = split(dataX, dataY)#分出三类
 12 
 13     posNum1 = int(len(positiveX) * alpha)  # alpha=0.5
 14 
 15     negNum1 = int(len(negativeX) * alpha)
 16 
 17     X2Num1 = int(len(X2) * alpha)
 18 
 19     posNum2 = len(positiveX) - posNum1  # 将第一类,第二类,第三类都平均分为两份,一份测试集,一份训练集,这里采用九折交叉
 20 
 21     negNum2 = len(negativeX) - negNum1
 22 
 23     X2Num2 = len(X2) - X2Num1
 24 
 25 
 26 
 27     posOrder = np.random.permutation(len(positiveX))#分别给两类标上序号
 28 
 29     negOrder = np.random.permutation(len(negativeX))
 30 
 31     X2Order = np.random.permutation(len(X2))
 32 
 33     dataX1 = []#训练集数据
 34 
 35     dataY1 = []#训练集标签
 36 
 37     dataX2 = []#测试集数据
 38 
 39     dataY2 = []#测试集标签
 40 
 41 
 42 
 43 
 44     for i in range(posNum1):                            #原文是for i in xrange(posNum1):但是xrange只能用于python2.x,以下全部替换了
 45 
 46         dataX1.append(positiveX[posOrder[i]])
 47 
 48         dataY1.append(1)
 49 
 50     for i in range(posNum2):
 51 
 52         dataX2.append(positiveX[posOrder[i + posNum1]])
 53 
 54         dataY2.append(1)
 55 
 56     for i in range(negNum1):
 57 
 58         dataX1.append(negativeX[negOrder[i]])
 59 
 60         dataY1.append(-1)
 61 
 62     for i in range(negNum2):
 63 
 64         dataX2.append(negativeX[negOrder[i + negNum1]])
 65 
 66         dataY2.append(-1)
 67     for i in range(X2Num1):
 68 
 69         dataX1.append(X2[X2Order[i]])
 70 
 71         dataY1.append(2)
 72 
 73     for i in range(X2Num2):
 74 
 75         dataX2.append(X2[X2Order[i + negNum1]])
 76 
 77         dataY2.append(2)
 78 
 79 
 80 
 81     return [dataX1, dataY1, dataX2, dataY2]#返回的是平分的两类数据,x 为数据,y是标签
 82 
 83 
 84 
 85 def resample(dataX, dataY):
 86 
 87     '''
 88 
 89     sample an equivalent size dataset uniformly from the original one从原始样本中均匀地采样等效大小的数据集
 90 
 91     '''
 92 
 93     instances = len(dataX)
 94 
 95 
 96 
 97     sampleX = []
 98 
 99     sampleY = []
100 
101 
102 
103     for i in range(instances):
104 
105         chosen = random.randint(0, instances-1)
106 
107         sampleX.append(dataX[chosen])
108 
109         sampleY.append(dataY[chosen])
110 
111 
112 
113     return [sampleX, sampleY]
114 
115 
116 
117 def split(dataX, dataY):
118 
119     '''
120 
121     divide the whole dataset into positive and negative ones 把整个数据集分为正数和负数 
122 
123     '''
124 
125     instances = len(dataX)
126 
127     positiveX = []
128 
129     negativeX = []
130 
131     X2=[]
132 
133     for i in range(instances):
134 
135         if dataY[i] == 1:
136 
137             positiveX.append(dataX[i])
138 
139         elif dataY[i] == 2 :
140 
141             negativeX.append(dataX[i])
142 
143         else :
144 
145             X2.append(dataX[i])
146 
147 
148 
149     return [positiveX, negativeX,X2]
150 
151 
152 
153 def F_norm(matrix):
154 
155     '''
156 
157     calculate the Frobenius norm of a matrix i.e |vec(A)|_2   计算矩阵的Frobenius范数,即VEC(A)2
158 
159     '''
160 
161     squared = map(lambda x: x**2, matrix)
162 
163     squared_sum = np.sum(squared)
164 
165     return math.sqrt(squared_sum)
166 
167 
168 
169 def M_norm(matrix, vector):
170 
171     '''
172 
173     calculate the M-norm of a matrix i.e \sqrt{v.T * M * v}   计算矩阵的m范数,即\sqrt{v.t*m *v}。 
174 
175     '''
176 
177     squared = np.dot(vector.T, np.dot(matrix, vector))
178 
179     return math.sqrt(squared[0][0])

sonar代码。功能同上顺序

  1 #!/usr/bin/python3
  2 from load1 import loader as loader    #作者自己写的程序
  3 from sklearn.preprocessing import LabelEncoder
  4 import numpy as np
  5 from log import log                  #作者自己写的程序
  6 from matplotlib import pyplot as plt
  7 import util2 as util2                #作者自己写的程序
  8 
  9 #第一个函数,计算均值向量
 10 def computeMeanVec(X, y, uniqueClass):
 11     """
 12     Step 1: Computing the d-dimensional mean vectors for different class 计算均值向量
 13     """
 14     np.set_printoptions(precision=4)#保留四位小数
 15 
 16     mean_vectors = []
 17     for cl in range(1,len(uniqueClass)+1):
 18         mean_vectors.append(np.mean(X[y==cl], axis=0)) #axis = 0:压缩行,对各列求均值,返回 1* n 矩阵
 19         log('Mean Vector class %s: %s\n' %(cl, mean_vectors[cl-1]))#
 20     return mean_vectors
 21 
 22 #第二个函数,计算类内离散度矩阵
 23 def computeWithinScatterMatrices(X, y, feature_no, uniqueClass, mean_vectors):
 24     # 2.1 Within-class scatter matrix
 25     S_W = np.zeros((feature_no, feature_no))#初始化矩阵吧
 26     for cl,mv in zip(range(1,len(uniqueClass)+1), mean_vectors):
 27         class_sc_mat = np.zeros((feature_no, feature_no))  # scatter matrix for every class
 28         for row in X[y == cl]:
 29             row, mv = row.reshape(feature_no,1), mv.reshape(feature_no,1)   # make column vectors
 30             class_sc_mat += (row-mv).dot((row-mv).T)
 31         S_W += class_sc_mat                                # sum class scatter matrices
 32     log('within-class Scatter Matrix: {}\n'.format(S_W))
 33 
 34     return S_W
 35 
 36 #计算类间离散度矩阵
 37 def computeBetweenClassScatterMatrices(X, y, feature_no, mean_vectors):
 38     # 2.2 Between-class scatter matrix
 39     overall_mean = np.mean(X, axis=0)
 40 
 41     S_B = np.zeros((feature_no, feature_no))
 42     for i,mean_vec in enumerate(mean_vectors):
 43         n = X[y==i+1,:].shape[0]
 44         mean_vec = mean_vec.reshape(feature_no, 1) # make column vector
 45         overall_mean = overall_mean.reshape(feature_no, 1) # make column vector
 46         S_B += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)
 47     log('between-class Scatter Matrix: {}\n'.format(S_B))
 48 
 49     return S_B
 50 
 51 def computeEigenDecom(S_W, S_B, feature_no):
 52     """
 53     Step 3: Solving the generalized eigenvalue problem for the matrix S_W^-1 * S_B
 54     求解矩阵 S_W^-1 * S_B 的广义特征值问题 
 55     """
 56     m = 10^-6 # add a very small value to the diagonal of your matrix before inversion
 57     #在反演之前向矩阵的对角线增加一个非常小的值
 58     # noinspection PyTypeChecker
 59     eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W+np.eye(S_W.shape[1])*m).dot(S_B))
 60 
 61     for i in range(len(eig_vals)):
 62         eigvec_sc = eig_vecs[:,i].reshape(feature_no, 1)
 63         log('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
 64         log('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))
 65 
 66     for i in range(len(eig_vals)):
 67         eigv = eig_vecs[:,i].reshape(feature_no, 1)
 68         # noinspection PyTypeChecker
 69         np.testing.assert_array_almost_equal(np.linalg.inv(S_W+np.eye(S_W.shape[1])*m).dot(S_B).dot(eigv),
 70                                              eig_vals[i] * eigv,
 71                                              decimal=6, err_msg='', verbose=True)
 72     log('Eigenvalue Decomposition OK')
 73 
 74     return eig_vals, eig_vecs
 75 
 76 def selectFeature(eig_vals, eig_vecs, feature_no):
 77     """
 78     Step 4: Selecting linear discriminants for the new feature subspace
 79     """
 80     # 4.1. Sorting the eigenvectors by decreasing eigenvalues
 81     # Make a list of (eigenvalue, eigenvector) tuples
 82     eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
 83 
 84     # Sort the (eigenvalue, eigenvector) tuples from high to low by the value of eigenvalue
 85     eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)
 86 
 87     log('Eigenvalues in decreasing order:\n')
 88     for i in eig_pairs:
 89         log(i[0])
 90 
 91     log('Variance explained:\n')
 92     eigv_sum = sum(eig_vals)
 93     for i,j in enumerate(eig_pairs):
 94         log('eigenvalue {0:}: {1:.2%}'.format(i+1, (j[0]/eigv_sum).real))
 95 
 96     # 4.2. Choosing k eigenvectors with the largest eigenvalues - here I choose the first two eigenvalues
 97     W = np.hstack((eig_pairs[0][1].reshape(feature_no, 1), eig_pairs[1][1].reshape(feature_no, 1)))
 98     log('Matrix W: \n{}'.format(W.real))
 99 
100     return W
101 
102 def transformToNewSpace(X, W, sample_no, mean_vectors, uniqueClass):
103     """
104     Step 5: Transforming the samples onto the new subspace
105     """
106     X_trans = X.dot(W)
107     mean_vecs_trans = []
108     for i in range(len(uniqueClass)):
109         mean_vecs_trans.append(mean_vectors[i].dot(W))
110 
111     #assert X_trans.shape == (sample_no,2), "The matrix is not size of (sample number, 2) dimensional."
112 
113     return X_trans, mean_vecs_trans
114 
115 def computeErrorRate(X_trans, mean_vecs_trans, y):
116     """
117     Compute the error rate
118     """
119 
120     """
121     Project to the second largest eigenvalue
122     """
123     uniqueClass = np.unique(y)
124     threshold = 0
125     for i in range(len(uniqueClass)):
126         threshold += mean_vecs_trans[i][1]
127     threshold /= len(uniqueClass)
128     log("threshold: {}".format(threshold))
129 
130     errors = 0
131     for (i,cl) in enumerate(uniqueClass):
132         label = cl
133         tmp = X_trans[y==label, 1]
134         # compute the error numbers for class i
135         num = len(tmp[tmp<threshold]) if mean_vecs_trans[i][1] > threshold else len(tmp[tmp>=threshold])
136         log("error rate in class {} = {}".format(i, num*1.0/len(tmp)))
137         errors += num
138 
139     errorRate = errors*1.0/X_trans.shape[0]
140     log("Error rate for the second largest eigenvalue = {}".format(errorRate))
141     log("Accuracy for the second largest eigenvalue = {}".format(1-errorRate))
142 
143 
144     """
145     Project to the largest eigenvalue - and return
146     """
147     uniqueClass = np.unique(y)
148     threshold = 0
149     for i in range(len(uniqueClass)):
150         threshold += mean_vecs_trans[i][0]
151     threshold /= len(uniqueClass)
152     log("threshold: {}".format(threshold))
153 
154     errors = 0
155     for (i,cl) in enumerate(uniqueClass):
156         label = cl
157         tmp = X_trans[y==label, 0]
158         # compute the error numbers for class i
159         num = len(tmp[tmp<threshold]) if mean_vecs_trans[i][0] > threshold else len(tmp[tmp>=threshold])
160         log("error rate in class {} = {}".format(i, num*1.0/len(tmp)))
161         errors += num
162 
163     errorRate = errors*1.0/X_trans.shape[0]
164     log("Error rate = {}".format(errorRate))
165     log("Accuracy = {}".format(1-errorRate))
166 
167     return 1-errorRate, threshold
168 
169 
170 
171 
172 def plot_step_lda(X_trans, y, label_dict, uniqueClass, dataset, threshold):
173 
174     ax = plt.subplot(111)
175     for label,marker,color in zip(range(1, len(uniqueClass)+1),('^', 's','d'),('blue', 'red','green')):
176         plt.scatter(x=X_trans[:,0].real[y == label], #X[:,0]就是取所有行的第0个数据, X[:,1] 就是取所有行的第1个数据
177                     y=X_trans[:,1].real[y == label],
178                     marker=marker,
179                     color=color,
180                     alpha=0.5,
181                     label=label_dict[label]
182                     )
183 
184     plt.xlabel('LDA1')
185     plt.ylabel('LDA2')
186 
187     leg = plt.legend(loc='upper right', fancybox=True)
188     #loc 图例所有figure位置,右上。 fancybox控制是否应在构成图例背景的FancyBboxPatch周围启用圆边
189     leg.get_frame().set_alpha(0.5)
190     plt.title('Fisher LDA: {} data projection onto the first 2 linear discriminants'.format(dataset))
191 
192     # plot the the threshold line  绘出分界线
193     [bottom, up] = ax.get_ylim()
194     #plt.axvline(x=threshold.real, ymin=bottom, ymax=0.3, linewidth=2, color='k', linestyle='--')
195     plt.axvline(threshold.real, linewidth=2, color='g')
196 
197     # hide axis ticks
198     plt.tick_params(axis="both", which="both", bottom="off", top="off",
199             labelbottom="on", left="off", right="off", labelleft="on")
200 
201     # remove axis spines
202     ax.spines["top"].set_visible(False)
203     ax.spines["right"].set_visible(False)
204     ax.spines["bottom"].set_visible(False)
205     ax.spines["left"].set_visible(False)
206 
207     plt.grid()
208     #plt.tight_layout
209     plt.show()
210     #第二个图
211 def plot_step_lda2(accuracys):
212     ax = plt.subplot(111)
213     plt.xlabel('Demention')
214     plt.ylabel('Accuracy')
215 
216     leg = plt.legend(loc='upper right', fancybox=True)
217     # loc 图例所有figure位置,右上。 fancybox控制是否应在构成图例背景的FancyBboxPatch周围启用圆边
218     leg.get_frame().set_alpha(0.5)
219     plt.title('Fisher LDA: {} data projection onto the first 2 linear discriminants'.format(dataset))
220     plt.axis([0, 60, 0, 1])
221     x=[3, 10, 20, 30, 40, 50, 60]
222     y=accuracys
223     plt.plot(x,y,linewidth=1,color='b')
224     plt.show()
225 
226 def mainFisherLDAtest(dataset='sonar', alpha=0.5):  #def mainFisherLDAtest(dataset='sonar', alpha=0.5):
227     # load data
228     path = dataset + '/' + dataset + '.data'
229     load = loader(path)#取数据
230     [X, y] = load.load()
231     [x1,y1]=[X,y]
232     # noinspection PyUnresolvedReferences
233     #改动,进行交叉验证
234     dimension=[3,10,20,30,40,50,60]
235     accuracys=[0,0,0,0,0,0,0]
236     accnumber=0
237     for i in range(7):
238         [X, y]=[x1,y1]
239         j=dimension[i]
240         [X, y, testX, testY] = util2.divide(X, y, alpha,j)#训练集和训练集标签,测试集和测试集标签
241         X = np.array(X) #数组
242         testX = np.array(testX)#测试集
243 
244         feature_no = X.shape[1] # define the dimension读取矩阵的维度,返回列数也就是样本维度
245         print(feature_no)
246         sample_no = X.shape[0] # define the sample number返回行数也就是样本个数
247 
248     # preprocessing
249         enc = LabelEncoder()
250         label_encoder = enc.fit(y)#对Y进行编码
251 
252 
253         y = label_encoder.transform(y) + 1#将Y变成正数
254         print(y)
255         testY = label_encoder.transform(testY) + 1#与上文一样
256     # noinspection PyTypeChecker
257         uniqueClass = np.unique(y) # define how many class in the outputs确定有几类输出
258         label_dict = {}   # define the label name
259         for i in range(1, len(uniqueClass)+1):
260             label_dict[i] = "Class"+str(i)
261         log(label_dict)       #打印类别
262 
263     # Step 1: Computing the d-dimensional mean vectors for different class
264         mean_vectors = computeMeanVec(X, y, uniqueClass)
265 
266 
267     # Step 2: Computing the Scatter Matrices
268         S_W = computeWithinScatterMatrices(X, y, feature_no, uniqueClass, mean_vectors)
269         S_B = computeBetweenClassScatterMatrices(X, y, feature_no, mean_vectors)
270 
271     # Step 3: Solving the generalized eigenvalue problem for the matrix S_W^-1 * S_B
272         eig_vals, eig_vecs = computeEigenDecom(S_W, S_B, feature_no)
273 
274     # Step 4: Selecting linear discriminants for the new feature subspace为新特征子空间选择线性判别法
275         W = selectFeature(eig_vals, eig_vecs, feature_no)
276 
277     # Step 5: Transforming the samples onto the new subspace将样本转化为新的子空间
278         X_trans, mean_vecs_trans = transformToNewSpace(testX, W, sample_no, mean_vectors, uniqueClass)
279 
280 
281     # Step 6: compute error rate
282         accuracy, threshold = computeErrorRate(X_trans, mean_vecs_trans, testY)
283 
284 
285     # plot
286         plot_step_lda(X_trans, testY, label_dict, uniqueClass, dataset, threshold)
287     #plot_step_lda(X_trans, testY, label_dict, uniqueClass, dataset, threshold)
288         accuracys[accnumber] = accuracy
289         accnumber = accnumber + 1
290     plot_step_lda2(accuracys)
291     accuracy = ( accuracys[0] + accuracys[1] +accuracys[2] + accuracys[3] + accuracys[4] + accuracys[5]+ accuracys[6])/6
292     return accuracy
293 
294 
295 
296 
297 if __name__ == "__main__":
298     dataset = ['ionosphere', 'sonar']  # choose the dataset
299     alpha = 0.7 # choose the train data percentage
300     accuracy = mainFisherLDAtest(dataset[1], alpha)
301     print(accuracy)
 1 import pandas as pd
 2 
 3 
 4 
 5 class loader(object):
 6 
 7 
 8 
 9     def __init__(self,file_name, split_token = ','):#def __init__(self,file_name , split_token = ','):
10 
11         self.file_name = 'sonar/sonar.data'
12 
13         self.split_token = split_token
14 
15 
16 
17     def load(self):
18 
19         df = pd.io.parsers.read_csv(
20 
21             filepath_or_buffer='请在此处写上自己数据的本地路径', #  原文 filepath_or_buffer = self.file_name,
22 
23                             header = None,
24 
25                             sep = self.split_token)     #原文 sep = self.split_token)
26 
27 
28 
29 
30         self.dimension = df.shape[1]-1
31 
32 
33         df.columns = list(range(self.dimension)) + ['label']     #原文df.columns = range(self.dimension) + ['label']
34 
35         df.dropna(how = 'all', inplace = True) # to drop the empty line at file-end 滤除缺失数据
36 
37         df.tail() #查看低端数据
38 
39 
40 
41         self.X = df[list(range(self.dimension))].values           # 原文self.X = df[range(self.dimension)].values
42 
43         self.Y = df['label'].values
44 
45         distinct_label = list(set(self.Y))        #创建分类标签列表
46 
47 
48 
49         if len(distinct_label) != 2:           #原文 if len(distinct_label) != 2:
50 
51             raise Exception('Two Labels Required\n','Label Sets:\n',distinct_label)
52             #原文 raise Exception('Two Labels Required\n','Label Sets:\n',distinct_label)
53             #raise Exception是一个函数,用于线程出现异常时抛出错误
54 
55 
56         self.Y = list(map(lambda x: {distinct_label[0]:1, distinct_label[1]:-1}[x], self.Y))
57         #原文 self.Y = map(lambda x: {distinct_label[0]: 1, distinct_label[1]: -1}[x], self.Y)
58 
59         return [self.X, self.Y]
60 
61 
62 
63 if __name__ == "__main__":
64 
65     loader = loader('ionosphere/ionosphere.data')        #原文 loader = loader('ionosphere/ionosphere.data')
66 
67     [X, y] = loader.load()
68 
69     print (X.shape)#X是整个数据的大小,Y是标签,共有三个标签
import random

import math

import numpy as np

#将数据分为训练集和数据集
#x是数据,y是标签
#divide a dataset into two parts, usually training and testing set
def divide(dataX, dataY, alpha,i):
    print(i)
    d=[]
    for k in range(i):
        d.append(k)

    print(d)
    dataX=dataX[:,d]
    print(dataX)

    [positiveX, negativeX] = split(dataX, dataY)#分出两类
    # alpha=0.5
    posNum1 = int (len(positiveX) * alpha)

    negNum1 = int (len(negativeX) * alpha)

    posNum2 = len(positiveX) - posNum1

    negNum2 = len(negativeX) - negNum1
    # 将第一类,第二类
    posOrder = np.random.permutation(len(positiveX))#分别给两类标上序号

    negOrder = np.random.permutation(len(negativeX))

    dataX1 = []

    dataY1 = []

    dataX2 = []

    dataY2 = []

    for i in range(posNum1):

        dataX1.append(positiveX[posOrder[i]])

        dataY1.append(1)

    for i in range(posNum2):

        dataX2.append(positiveX[posOrder[i + posNum1]])

        dataY2.append(1)

    for i in range(negNum1):

        dataX1.append(negativeX[negOrder[i]])

        dataY1.append(-1)

    for i in range(negNum2):

        dataX2.append(negativeX[negOrder[i + negNum1]])

        dataY2.append(-1)


    return [dataX1, dataY1, dataX2, dataY2]#返回的是平分的两类数据



def resample(dataX, dataY):

    '''

    sample an equivalent size dataset uniformly from the original one从原始样本中均匀地采样等效大小的数据集

    '''

    instances = len(dataX)



    sampleX = []

    sampleY = []



    for i in range(instances):

        chosen = random.randint(0, instances-1)

        sampleX.append(dataX[chosen])

        sampleY.append(dataY[chosen])



    return [sampleX, sampleY]



def split(dataX, dataY):

    '''

    divide the whole dataset into positive and negative ones 把整个数据集分为正数和负数 

    '''

    instances = len(dataX)

    positiveX = []

    negativeX = []



    for i in range(instances):

        if dataY[i] == 1:

            positiveX.append(dataX[i])

        else:

            negativeX.append(dataX[i])



    return [positiveX, negativeX]



def F_norm(matrix):

    '''

    calculate the Frobenius norm of a matrix i.e |vec(A)|_2   计算矩阵的Frobenius范数,即VEC(A)2

    '''

    squared = map(lambda x: x**2, matrix)

    squared_sum = np.sum(squared)

    return math.sqrt(squared_sum)



def M_norm(matrix, vector):

    '''

    calculate the M-norm of a matrix i.e \sqrt{v.T * M * v}   计算矩阵的m范数,即\sqrt{v.t*m *v}。 

    '''

    squared = np.dot(vector.T, np.dot(matrix, vector))

    return math.sqrt(squared[0][0])

就这些了。

 

 

 

posted @ 2018-10-28 21:09  al7ice  阅读(783)  评论(0编辑  收藏  举报