2022 NUS summer workshop visual computing phrase2 tutorial3:Digit Recognition notes

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.
 
一、Tasks:
1.  Familiarize yourself with the MNIST dataset: http://yann.lecun.com/exdb/mnist/
2. Familiarize yourself with sklearn package: https://scikit-learn.org/stable/
3. Study k-Nearest Neighbours classifiers :https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
4. Study RandomForest classifiers :https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
5. Study Naïve Bayes classifiers :https://scikit-learn.org/stable/modules/naive_bayes.html
 
二、Programming exercises:
Q1.
Use the fetch_openml function found in sklearn.datasets to load the mnist_784 dataset into python. This will load X and y variables for you.
• Print the dimensions of the variables returned by the function.
• Write a python script to find how many distinct values are present in y?
• Select one sample from X for each distinct y value.
• Resize each sample to represent the 28x28 pixel image.
• Display all the selected images in one diagram using subplots in matplotlib. The following code gives you an example of how to do this:

 

 1.首先导入sklearn的包,使用fetch_openml导入数据集,代码如下:

fetch_openml函数的各项参数可看:https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html?highlight=fetch_openml#sklearn.datasets.fetch_openml

from sklearn.datasets import fetch_openml

X,y=fetch_openml('mnist_784',version=1,as_frame=False,return_X_y=True)

Print the dimensions of the variables returned by the function:

print(X.shape)
print(y.shape)

结果为:

(70000, 784)
(70000,)

 

2.Write a python script to find how many distinct values are present in y?

 使用numpy.unique

import numpy as np
distinct_y = np.unique(y) print(len(distinct_y))

结果为:

10

 

3.Select one sample from X for each distinct y value.

使用np.where()方法,代码如下:


distinct_idx=[] for i in distinct_y: distinct_idx.append(np.where(y==i)[0][0]) #先把选取的sample的序号找出来
print(distinct_idx)

结果为:

[1, 3, 5, 7, 2, 0, 13, 15, 17, 4]

 

 4.Resize each sample to represent the 28x28 pixel image.

5.Display all the selected images in one diagram using subplots in matplotlib. 

4和5一起做 展示图片需要使用到matplotlib库,代码如下:

import matplotlib.pyplot as plt
fig = plt.figure()

for i,j in enumerate(distinct_idx):
    image=X[j].reshape((28,28))
    
    fig.add_subplot(2,5,i+1)
    plt.imshow(image,cmap='gray')
plt.show()

结果为:

到这里 Q1就完成了,接下来完成Q2

 

Q2:Use sklearn to train a digit classifier.

• Split the X and y into a training set and testing set of 80-20 split.
• Train a Support Vector Machin (SVM) for classification of the digits using the training set.
• Test the model using the test set.
• Experiment with different parameter values for the SVM and see how it performs. Try changing the gamma value to be [0.0001, 0.0005, 0.001, 0.005, 0.01]
• Plot the accuracy value with respect to the change in gamma above.
接下来开始完成

 1. Split the X and y into a training set and testing set of 80-20 split.

使用sklearn.model_selection.train_test_split()方法进行分割,代码如下:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size =0.8,test_size=0.2) #由于我们想要训练集和测试集八二分,因此设置train_size=0.8,test_size=0.2
print(X_train.shape,X_test.shape)

结果为:

(56000, 784) (14000, 784) #70000张图片,有56000张作为训练集,14000张作为测试集

2.Train a Support Vector Machin (SVM) for classification of the digits using the training set.

使用到skclearn.svm.SVC(),代码如下:

from sklearn.svm import SVC
clf_svc = SVC() #could change the value of gamma in SVC()
clf_svc.fit(X_train,y_train)
y_pred = clf_svc.predict(X_test)

3.Test the model using the test set.

用模型的准确率(accuracy)来衡量模型的优劣,计算准确度使用到sklearn.metrics.accuracy_score():

from sklearn import metrics
accuracy = metrics.accuracy_score(y_test,y_pred)
print(accuracy)

结果为:

0.9797857142857143 #准确率还不错

4.Experiment with different parameter values for the SVM and see how it performs. Try changing the gamma value to be [0.0001, 0.0005, 0.001, 0.005, 0.01]

 

gm=[0.0001,0.0005,0.001,0.005,0.01]
accuracy = []
for i in gm:
    clf_svc = SVC(gamma=i) 
    clf_svc.fit(X_train,y_train)
    y_pred = clf_svc.predict(X_test)
    accuracy.append(metrics.accuracy_score(y_test,y_pred))
    print(accuracy)

 

5.Plot the accuracy value with respect to the change in gamma above.

 

posted @ 2022-07-13 15:09  liyuSCU  阅读(88)  评论(0编辑  收藏  举报