Spark:特征处理之spark.ml.feature常用工具
VectorAssembler
将不同的特征列进行组合,成为特征向量列,作为训练学习器的输入列。
val df = spark.createDataset(List(
(1, "a", 3),
(2, "", 4))).toDF("f1", "f2", "f3")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("f1", "f3"))
.setOutputCol("features")
df.show
+---+---+---+
| f1| f2| f3|
+---+---+---+
| 1| a| 3|
| 2| | 4|
+---+---+---+
vectorAssembler.transform(df).show
+---+---+---+---------+
| f1| f2| f3| features|
+---+---+---+---------+
| 1| a| 3|[1.0,3.0]|
| 2| | 4|[2.0,4.0]|
+---+---+---+---------+
Tokenizer
将文本列转化为全部小写,并以空格作为分割符分割,输出分割后的值。
val df = spark.createDataset(List(
"Hello world",
"hello world HELLO")).toDF("text")
val t = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
df.show
+-----------------+
| text|
+-----------------+
| hello world|
|hello world hello|
+-----------------+
t.transform(df).show(false)
+-----------------+---------------------+
|text |words |
+-----------------+---------------------+
|Hello world |[hello, world] |
|hello world HELLO|[hello, world, hello]|
+-----------------+---------------------+
//其中words列是Seq[String]类型的
HashingTF
先说spark ml模块的向量Vector
其索引值为Int
类型,值为Double
类型。
分两种:
//引入org.apache.spark.ml.linalg.Vector的工厂方法对象(实际为一个object)
//不使用名称Vector,是因为默认情况下,Scala会导入scala.collection.immutable.Vector, 为了和这个类的伴生对象更好的区分
import org.apache.spark.ml.linalg.Vectors
//稠密型向量
Vectors.dense(Array(1.2, 2, 3)) // [1.2,2.0,3.0]
Vectors.dense(1.2, 2, 3) // [1.2,2.0,3.0]
//稀疏型向量
Vectors.sparse(10, Seq(0->1.0, 3->3.1)) // (10,[0,3],[1.0,3.1])
Vectors.sparse(10, Array(0, 3), Array(1, 3.1)) // (10,[0,3],[1.0,3.1])
//原型:Vectors.sparse(size, indexs, values)
//意思是这个向量一共有10个元素,第0个元素值为1.0, 第3个元素值为3.1
再介绍org.apache.spark.ml.feature.HashingTF类
这个类是一个Transformer
的子类,其transform()
方法能够完成的转换是:
将输入df的一列(要求这一列必须是数组形式)输入hash算法,然后映射得到一个稀疏矩阵——
- 其中将
inputCol
列的每一行都映射为一个稀疏向量 - 向量的size由
numFeatures
指定, inputCol
列每一行中元素的索引值为由该元素计算得到的hash值,向量值为元素的频次或1(取决于binary
参数)。
其参数有:
binary
: 默认false,得到的稀疏矩阵的元素值为出现词的频次;
如果设为ture则得到的稀疏矩阵的(出现的词位置的)元素值为1。numFeatures
: 映射到频率矩阵的特征数量,默认为2^18(262144)inputCol
: 指定将输入df的哪一列作为输入进行映射计算outputCol
: 指定输出列的列名
import org.apache.spark.ml.feature.HashingTF
val training = spark.createDataFrame(Seq(
(0L, Array("spark", "hadoop"), 1.0),
(1L, Array("spark", "hadoop"), 0.0),
(2L, Array("hadoop", "hadoop"), 1.0),
(3L, Array("spark", "hadoop", "spark"), 0.0)))
.toDF("id", "words", "label")
//这里注意:DF中的Array会变为Seq处理而不是Aarry
//也就是这里说training的words列是Seq[_]类型,而不是Array[_]类型,若将其当成Array类型处理会报错:
//scala.MatchError: [0,WrappedArray(spark, hadoop),1.0] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
training.show(false)
+---+----------------------+-----+
|id |words |label|
+---+----------------------+-----+
|0 |[spark, hadoop] |1.0 |
|1 |[spark, hadoop] |0.0 |
|2 |[hadoop, hadoop] |1.0 |
|3 |[spark, hadoop, spark]|0.0 |
+---+----------------------+-----+
new HashingTF()
.setInputCol("words")
.setOutputCol("feature")
.setNumFeatures(1024)
.transform(training)
.show(false)
+---+----------------------+-----+--------------------------+
|id |words |label|feature |
+---+----------------------+-----+--------------------------+
|0 |[spark, hadoop] |1.0 |(1024,[161,493],[1.0,1.0])|
|1 |[spark, hadoop] |0.0 |(1024,[161,493],[1.0,1.0])|
|2 |[hadoop, hadoop] |1.0 |(1024,[493],[2.0]) |
|3 |[spark, hadoop, spark]|0.0 |(1024,[161,493],[2.0,1.0])|
+---+----------------------+-----+--------------------------+
StringIndexer和IndexToString
StringIndexer将String类型的标签列按照出现频次建立索引:出现次数最多的标签索引值为0。(若输入列为数字类型,先将其转换为String类型再建索引)
IndexToString将索引值依次转换为字符串标签。
val df = spark.createDataset(List("High", "High", "Medium", "Low", "Low", "Low"))
.toDF("level")
val labelIndexer = new StringIndexer()
.setInputCol("level")
.setOutputCol("label")
.fit(df) //由于统计标签频次需要遍历df,所以这里要fit
val df_label = labelIndexer.transform(df)
df.show
+------+
| level|
+------+
| High|
| High|
|Medium|
| Low|
| Low|
| Low|
+------+
df_label.show
+------+-----+
| level|label|
+------+-----+
| High| 1.0|
| High| 1.0|
|Medium| 2.0|
| Low| 0.0|
| Low| 0.0|
| Low| 0.0|
+------+-----+
val labels = labelIndexer.labels
// labels:Array[String] = Array(Low, High, Medium)
val labelConverter = new IndexToString()
.setInputCol("label")
.setOutputCol("cov_level")
.setLabels(labels)
labelConverter.transform(df_label).show
+------+-----+---------+
| level|label|cov_level|
+------+-----+---------+
| High| 1.0| High|
| High| 1.0| High|
|Medium| 2.0| Medium|
| Low| 0.0| Low|
| Low| 0.0| Low|
| Low| 0.0| Low|
+------+-----+---------+