SVM实践
在Ubuntu上使用libsvm(附上官网链接以及安装方法)进行SVM的实践:
1、代码演示:(来自一段文本分类的代码)
# encoding=utf8 __author__ = 'wang' # set the encoding of input file utf-8 import sys reload (sys) sys.setdefaultencoding( 'utf-8' ) import os from svmutil import * import subprocess # get the data in svm format featureStringCorpusDir = './svmFormatFeatureStringCorpus/' fileList = os.listdir(featureStringCorpusDir) # get the number of the files numFile = len (fileList) trainX = [] trainY = [] testX = [] testY = [] # enumrate the file for i in xrange (numFile): # print i fileName = fileList[i] if (fileName.endswith( ".libsvm" )): # check the data format<br> # checkdata_py = r "./tool/checkdata.py" # the file name svm_file = featureStringCorpusDir + fileName cmd = 'python {0} {1} ' . format (checkdata_py, svm_file) # print ("excute '{}'".format(cmd)) # print('Check Data...') check_data = subprocess.Popen(cmd, shell = True , stdout = subprocess.PIPE) (stdout, stderr) = check_data.communicate() returncode = check_data.returncode if returncode = = 1 : print stdout exit( 1 ) y, x = svm_read_problem(svm_file) trainX + = x[: 100 ] trainY + = y[: 100 ] testX + = x[ - 1 : - 6 : - 1 ] testY + = y[ - 1 : - 6 : - 1 ] print testX # train the c-svm model m = svm_train(trainY, trainX, '-s 0 -c 4 -t 2 -g 0.1 -e 0.1' ) # predict the test data p_label, p_acc, p_val = svm_predict(testY, testX, m) # output the result to the file result_file = open ( "./result.txt" , "w" ) result_file.write( "testY p_label p_val\n" ) for c in xrange ( len (p_label)): result_file.write( str (testY[c]) + " " + str (p_label[c]) + " " + str (p_val[c]) + "\n" ) result_file.write( "accuracy: " + str (p_acc) + "\n" ) result_file.close() # print p_val |
(1)关于训练数据的格式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | The format of training and testing data file is : <label> <index1>:<value1> <index2>:<value2> .. . . . Each line contains an instance and is ended by a '\n' character. For <br>classification, <label> is an integer indicating the class label (multi - class is supported)(如何label是个非数值的,将其转化为数值型). For regression, <label> is <br>the target value which can be any real number. For one - class SVM, it's not used so can be any number. The pair <index>:<value> gives a feature (attribute) value:<br><index> is an integer starting from 1 and <value> is a real number. The only exception <br> is the precomputed kernel, where <index> starts from 0 ; see the section of precomputed kernels. <br>Indices must be in ASCENDING order(要升序排序). Labels in the testing file are only used to calculate accuracy <br> or errors. If they are unknown, just fill the first column with any numbers. A sample classification data included in this package is `heart_scale '. To check if your data is <br>in a correct form, use `tools/checkdata.py' (details in `tools / README'). Type `svm - train heart_scale', and the program will read the training data and output the model file `heart_scale.model'. If you have a test set called heart_scale.t, then type `svm - predict heart_scale.t heart_scale.model output ' to see the prediction accuracy. The `output' file contains the predicted class labels. For classification, if training data are in only one class (i.e., all labels are the same), then `svm - train' issues a warning message: `Warning: training data in only one class . See README for details,' which means the training data is very unbalanced. The label in the training data is directly returned when testing. |
(2)svm训练的参数解释来自官方的github:https://github.com/cjlin1/libsvm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | `svm - train' Usage = = = = = = = = = = = = = = = = = Usage: svm - train [options] training_set_file [model_file] options: - s svm_type : set type of SVM (default 0 ) 0 - - C - SVC (multi - class classification) 1 - - nu - SVC (multi - class classification) 2 - - one - class SVM 3 - - epsilon - SVR (regression) 4 - - nu - SVR (regression) - t kernel_type : set type of kernel function (default 2 ) 0 - - linear: u' * v 1 - - polynomial: (gamma * u' * v + coef0)^degree 2 - - radial basis function: exp( - gamma * |u - v|^ 2 ) 3 - - sigmoid: tanh(gamma * u' * v + coef0) 4 - - precomputed kernel (kernel values in training_set_file) - d degree : set degree in kernel function (default 3 ) - g gamma : set gamma in kernel function (default 1 / num_features) - r coef0 : set coef0 in kernel function (default 0 ) - c cost : set the parameter C of C - SVC, epsilon - SVR, and nu - SVR (default 1 ) - n nu : set the parameter nu of nu - SVC, one - class SVM, and nu - SVR (default 0.5 ) - p epsilon : set the epsilon in loss function of epsilon - SVR (default 0.1 ) - m cachesize : set cache memory size in MB (default 100 ) - e epsilon : set tolerance of termination criterion (default 0.001 ) - h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1 ) - b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0 ) - wi weight : set the parameter C of class i to weight * C, for C - SVC (default 1 ) (当数据不平衡时,可给不同的类设置不用的惩罚) - v n: n - fold cross validation mode - q : quiet mode (no outputs) The k in the - g option means the number of attributes in the input data. option - v randomly splits the data into n parts and calculates cross validation accuracy / mean squared error on them. See libsvm FAQ for the meaning of outputs. `svm - predict' Usage = = = = = = = = = = = = = = = = = = = Usage: svm - predict [options] test_file model_file output_file options: - b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0 ); for one - class SVM only 0 is supported model_file is the model file generated by svm - train. test_file is the test data you want to predict. svm - predict will produce output in the output_file. |
注:
nu-SVR: nu回归机。引入能够自动计算epsilon的参数nu。若记错误样本的个数为q ,则nu大于等于q/l,即nu是错误样本的个数所占总样本数的份额的上界;若记支持向量的个数为p,则nu小于等于p/l,即nu是支持向量的个数所占总样本数的份额的下界。首先选择参数nu和C,然后求解最优化问题。
Shrinking: 即参数-h, 优化求解过程中是否采用shrinking. 边界支持向量BSVs(ai=C的SV)在迭代过程中ai不会变化,如果找到这些点,并把它们固定为C,可以减少QP的规模。
2、运行结果展示并分析部分参数的意思:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | * optimization finished, #iter = 119 nu = 0.272950 obj = - 122.758206 , rho = - 1.384603 nSV = 127 , nBSV = 21 * ..............(此处省略部分) * optimization finished, #iter = 129 nu = 0.214275 obj = - 89.691907 , rho = - 1.105131 nSV = 103 , nBSV = 8 * optimization finished, #iter = 77 nu = 0.147922 obj = - 67.825431 , rho = 0.984237 nSV = 80 , nBSV = 10 Total nSV = 1246 Accuracy = 51.4286 % ( 36 / 70 ) (classification) |
官方解释:obj is the optimal objective value of the dual SVM problem(obj表示SVM最优化问题对偶式的最优目标值,公式如下图的公式(2)). rho is the bias term in the decision function sgn(w^Tx - rho). nSV and nBSV are number of support vectors and bounded support vectors (i.e., alpha_i = C). nu-svm is a somewhat equivalent form of C-SVM where C is replaced by nu. nu simply shows the corresponding parameter. More details are in libsvm document.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
· 【杂谈】分布式事务——高大上的无用知识?