Transforming the prediction target of sklearn

concept

https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets

对于监督性学习，其目标值需要进行转化，才能作为模型的目标，或者更加有效地适应模型。

These are transformers that are not intended to be used on features, only on supervised learning targets.

See also Transforming target in regression if you want to transform the prediction target for learning, but evaluate the model in the original (untransformed) space.

模型自适应

https://scikit-learn.org/stable/tutorial/basic/tutorial.html#model-persistence

有的模型，其目标支持，原始类型（字符串，或者数值类型）。如下所示。

对于这种模型，转换并不是必要的，但是对目标的转换时一种更加通用的做法。

>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> iris = datasets.load_iris()
>>> clf = SVC()
>>> clf.fit(iris.data, iris.target)
SVC()

>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]

>>> clf.fit(iris.data, iris.target_names[iris.target])
SVC()

>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']

LabelBinarizer

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html#sklearn.preprocessing.LabelBinarizer

一些回归和二值分类算法，需要使用此工具，将目标转换，进而支持multiclass分类。

Binarize labels in a one-vs-all fashion.

Several regression and binary classification algorithms are available in scikit-learn. A simple way to extend these algorithms to the multi-class classification case is to use the so-called one-vs-all scheme.

At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer makes this process easy with the transform method.

At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform method.

code:

from sklearn import preprocessing
import numpy as np

lb = preprocessing.LabelBinarizer()
lb.fit([1, 2, 6, 4, 2])

print(lb.classes_)

print(lb.transform([1, 6]))

transformed_label =  np.array([[1, 0, 0, 0],[0, 0, 0, 1]]) 

print(lb.inverse_transform(transformed_label))

output

[1 2 4 6]
[[1 0 0 0]
 [0 0 0 1]]
[1 6]

MultiLabelBinarizer

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer

将多标记的目标转换为二值型目标。

Transform between iterable of iterables and a multilabel format.

Although a list of sets or tuples is a very intuitive format for multilabel data, it is unwieldy to process. This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of a class label.

In multilabel learning, the joint set of binary classification tasks is expressed with a label binary indicator array: each sample is one row of a 2d array of shape (n_samples, n_classes) with binary values where the one, i.e. the non zero elements, corresponds to the subset of labels for that sample. An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]]) represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.

Producing multilabel data as a list of sets of labels may be more intuitive. The MultiLabelBinarizer transformer can be used to convert between a collection of collections of labels and the indicator format:

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> y = [[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]]
>>> MultiLabelBinarizer().fit_transform(y)
array([[0, 0, 1, 1, 1],
       [0, 0, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 0]])

LabelEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder

将标签（字符或者数字），转化为紧致型的 multiclass目标。

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

Read more in the User Guide.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']