Spark ML包,数据挖掘示例数据Affairs

1.数据字段解释

affairs:一年来婚外情的频率  
gender:性别  
age:年龄  
yearsmarried:婚龄  
children:是否有小孩  
religiousness:宗教信仰程度(5分制,1分表示反对,5分表示非常信仰) 
education:学历 
occupation:职业(逆向编号的戈登7种分类)  
rating:对婚姻的自我评分(5分制,1表示非常不幸福,5表示非常幸福)

 

2.数据列表

 

3.定义列名

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
val colArray: Array[String] = Array("affairs", "gender", "age", "yearsmarried", "children", "religiousness", "education", "occupation", "rating"
 
val data = dataList.toDF(colArray:_*)
   
data.printSchema()
root
 |-- affairs: double (nullable = false)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = false)
 |-- yearsmarried: double (nullable = false)
 |-- children: string (nullable = true)
 |-- religiousness: double (nullable = false)
 |-- education: double (nullable = false)
 |-- occupation: double (nullable = false)
 |-- rating: double (nullable = false)
  
data.show(10)
+-------+------+----+------------+--------+-------------+---------+----------+------+
|affairs|gender| age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+----+------------+--------+-------------+---------+----------+------+
|    0.0|  male|37.0|        10.0|      no|          3.0|     18.0|       7.0|   4.0|
|    0.0|female|27.0|         4.0|      no|          4.0|     14.0|       6.0|   4.0|
|    0.0|female|32.0|        15.0|     yes|          1.0|     12.0|       1.0|   4.0|
|    0.0|  male|57.0|        15.0|     yes|          5.0|     18.0|       6.0|   5.0|
|    0.0|  male|22.0|        0.75|      no|          2.0|     17.0|       6.0|   3.0|
|    0.0|female|32.0|         1.5|      no|          2.0|     17.0|       5.0|   5.0|
|    0.0|female|22.0|        0.75|      no|          2.0|     12.0|       1.0|   3.0|
|    0.0|  male|57.0|        15.0|     yes|          2.0|     14.0|       4.0|   4.0|
|    0.0|female|32.0|        15.0|     yes|          4.0|     16.0|       1.0|   2.0|
|    0.0|  male|22.0|         1.5|      no|          4.0|     14.0|       4.0|   5.0|
+-------+------+----+------------+--------+-------------+---------+----------+------+
only showing top 10 rows

 

 

4.查看数据的统计分布情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
val descrDF = data.describe(colArray:_*)
 
descrDF.printSchema()
root
 |-- summary: string (nullable = true)
 |-- affairs: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- yearsmarried: string (nullable = true)
 |-- children: string (nullable = true)
 |-- religiousness: string (nullable = true)
 |-- education: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- rating: string (nullable = true)
 
 
descrDF.selectExpr("summary",
            "round(affairs,2) as affairs",
            "round(age,2) as age",
            "round(yearsmarried,2) as yearsmarried",
            "children",
            "round(religiousness,2) as religiousness",
            "round(education,2) as education",
            "round(occupation,2) as occupation",
            "round(rating,2) as rating").show(10, truncate = false)
+-------+-------+-----+------------+--------+-------------+---------+----------+------+
|summary|affairs|age  |yearsmarried|children|religiousness|education|occupation|rating|
+-------+-------+-----+------------+--------+-------------+---------+----------+------+
|count  |601.0  |601.0|601.0       |601     |601.0        |601.0    |601.0     |601.0 |
|mean   |1.46   |32.49|8.18        |null    |3.12         |16.17    |4.19      |3.93  |
|stddev |3.3    |9.29 |5.57        |null    |1.17         |2.4      |1.82      |1.1   |
|min    |0.0    |17.5 |0.13        |no      |1.0          |9.0      |1.0       |1.0   |
|max    |12.0   |57.0 |15.0        |yes     |5.0          |20.0     |7.0       |5.0   |
+-------+-------+-----+------------+--------+-------------+---------+----------+------+

 

posted @   智能先行者  阅读(2694)  评论(0编辑  收藏  举报
点击右上角即可分享
微信分享提示