pyspark数据准备
鸢尾花数据集
1 5.1,3.5,1.4,0.2,Iris-setosa 2 4.9,3.0,1.4,0.2,Iris-setosa 3 4.7,3.2,1.3,0.2,Iris-setosa 4 4.6,3.1,1.5,0.2,Iris-setosa 5 5.0,3.6,1.4,0.2,Iris-setosa 6 5.4,3.9,1.7,0.4,Iris-setosa 7 4.6,3.4,1.4,0.3,Iris-setosa 8 5.0,3.4,1.5,0.2,Iris-setosa
转换成libsvm格式代码
1 import sys 2 3 file = sys.argv[1] 4 5 def main(): 6 with open(file,'r') as df: 7 for line in df: 8 ss = line.strip().split(",") 9 if ss[4]=="Iris-setosa": 10 ss[4]=0 11 if ss[4]=="Iris-versicolor": 12 ss[4]=1 13 if ss[4]=="Iris-virginica": 14 ss[4]=2 15 print("%d 1:%.1f 2:%.1f 3:%.1f 4:%.1f"%(ss[4],float(ss[0]),float(ss[1]),float(ss[2]),float(ss[3]))) 16 if __name__ == '__main__': 17 try: 18 main() 19 except Exception as e: 20 raise e
libsvm格式的鸢尾花数据集
1 0 1:5.1 2:3.5 3:1.4 4:0.2 2 0 1:4.9 2:3.0 3:1.4 4:0.2 3 0 1:4.7 2:3.2 3:1.3 4:0.2 4 0 1:4.6 2:3.1 3:1.5 4:0.2 5 0 1:5.0 2:3.6 3:1.4 4:0.2 6 0 1:5.4 2:3.9 3:1.7 4:0.4 7 0 1:4.6 2:3.4 3:1.4 4:0.3 8 0 1:5.0 2:3.4 3:1.5 4:0.2 9 0 1:4.4 2:2.9 3:1.4 4:0.2 10 0 1:4.9 2:3.1 3:1.5 4:0.1 11 0 1:5.4 2:3.7 3:1.5 4:0.2
pyspark读取libsvm格式数据并转换
>>> from pyspark.mllib.util import MLUtils
>>>
examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
>>> examples.take(2)
[Stage 26:> (0 + 1) / 1]
[LabeledPoint(0.0, (4,[0,1,2,3],[5.1,3.5,1.4,0.2])), LabeledPoint(0.0, (4,[0,1,2
,3],[4.9,3.0,1.4,0.2]))]