Theoretical comparison between the Gini Index and Information Gain criteria
Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for a high payoff in many business and scientific applications. One of the main tasks in KDD is classification. A particular efficient method for classification is decision tree induction. The selection of the attribute used at each node of the tree to split the data (split criteria) is crucial in order to correctly classify objects. Different split criteria were proposed in the literature (Information Gain, Gini Index, etc.). It is not obvious which of them will produce the best decision tree for a given data set. A large amount of empirical tests were conducted in order to answer this question. No conclusive results were found. In this paper we introduce a formal methodology, which allows us to compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given data set. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.
Keywords: decision trees, classification, Gini Index, Information Gain, theoretical comparison
-
1. Introduction
Early work in the field of decision tree construction focused mainly on the definition and on the realization of classification systems. All of them use different measures of impurity/entropy/goodness to select the split attribute in order to construct the decision tree.
Once a certain number of algorithms were defined, a lot of research was dedicated to compare them. This is a relatively difficult task as the systems evolved from different backgrounds: information theory, discriminant analysis, encoding techniques, etc. These comparisons have been predominantly empirical. Baker and Jain reported experiments comparing eleven feature evaluation criteria and concluded that the feature rankings induced by various rules are very similar. Several feature evaluation criteria are compared using simulated data by Ben-Bassat, on a sequential, multi-class classification problem. The conclusions are that no feature selection rule is consistently superior to the others, and that no specific strategy for alternating different rules is significantly more effective. Mingers compared several attribute selection criteria, and concluded that the tree quality does not seem to depend on the specific criterion used. Babic compared ID3 and CART for two clinical diagnosis problems. Miyakawa compared three activity-based measures, both analytically and empirically. Several researchers pointed out that Information Gain is biased towards attributes with a large number of possible values. Mingers compared Information Gain and -statistic for growing the tree as well as for stop splitting. He concluded that -corrected Information Gain's bias towards multi-valued attributes. Quinlan suggested Gain Ratio as a remedy for the bias of Information Gain. Mantaras argued that Gain Ratio had its own set of problems, and suggested information theory based distance between partitions for tree constructions. White and Liu present experiments to conclude that Information Gain, Gain Ratio and Mantara's measure are worse than a -based statistical measure, in terms of their bias towards multiple-valued attributes. Gama tried to predict the error rate of a particular classification algorithm and he indicated that no single method can be considered better than the others. About twenty different algorithms were evaluated on more than twenty different data sets. Kononenko pointed out that Minimum Description Length based feature evaluation criteria have the least bias towards multi-valued attribute. In twenty-two decision tree and two neural network algorithms are compared in terms of classification accuracy, training time, and number of leaves. In Gini Index, Information Gain, and the new family of split functions are tested on data sets of different sizes (from to tuples). In , the authors proposed a measure for the distance between the bias of two evaluation metrics and gave numerical approximations of it.
However, a thorough understanding of the behavior of the split functions demands an analytical and direct comparison between them, without using any other external measure. Our contribution in this paper is to introduce a formal methodology, which allows us to analytically compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given dataset. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.
-
3. The Gini Index and Information Gain criteria
The objects are classified by decision trees which sort them down from the root to some leaf node, which provides the classification (the class) of each object. An decision tree contains zero or more internal nodes and one or more leaf nodes. The internal nodes have two or more child nodes. Each nonterminal node contains a split which specifics a test based on a single attribute, and each branch descending from that node corresponds to one of the possible values for this attribute. Each leaf node has its class label. A leaf node is said to be pureif all of its training examples are belonging to the same class.
Thus, an example is classified by starting with the root node, testing the attribute corresponding to the root node, then moving down the tree brand corresponding to the value of the attribute in the given example. This process continues until the example reaches a leaf node.
In the binary tree classifiers are constructed by repeatedly splitting subsets of into two descendant subsets, beginning with itself. To split into smaller and smaller subsets we have to select the splits in such a way that the descendant subsets are always “purer” than their parents. Thus was introduced the “goodness of split” criterion, which is derived from the notion of an impurity function.
An impurity function is a function defined on the set of all -tuples of numbers satisfying and with the following properties:
(a) achieves its maximum at the point ;
(b) achieves its minimum at the points ;
(c) is a symmetric function of .
Given an impurity function , the impurity measure of any node is defined by
If a split in a node divides all examples into two subsets and of proportions and , the decrease of impurity is defined as
The goodness of split in node , , is defined as .
If a test is used in a node and this test is based on an attribute having possible values, the expressions defined before are generalized as follows:
,
.
Breiman adopts in his work the Gini diversity Index which has the following form:
(1)
In a node , an impurity function based on the Gini Index criterion assigns a training example to a class with the probability . The estimated probability that the item is actually in class is . Therefore, the estimated probability of misclassification under this rule is the Gini Index:
.
This function can also be interpreted in terms of variance. In a node we assign to all examples belonging to class the value , and to all other examples the value . The sample variance of these values is . There are classes, thus the corresponding variances are summed together:
.
Having a test with outcomes the goodness of the split is expressed using the Gini Index as follows:
(2)
The Gini Index criterion selects a test that maximizes this function.
The Information Gain function has its origin in information theory. It is based on the notion of entropy, which characterizes the impurity of an arbitrary set of examples. If we randomly select an example from a set and we announce that it belongs to the class , then the probability of this message is equal to , and the amount of information it conveys is . The expected information provided by a message with respect to the class membership can be expressed as
(3)
The quality measures the average amount of information needed to identify the class of an example in . This quantity is also known as the entropy of the set relative to the -wise classification. The algorithm is in base because the entropy is a measure of the expected encoding length measured in bits. We will consider a similar measurement after has been partitioned in accordance with the outcomes of a test . The expected information requirement is the weight sum over the subsets:
The information gained by partitioning in accordance to the test is measured by the quantity . We can rewrite the Information Gain as
(4)
The information Gain criterion selects a test that maximizes the Information Gain function.
posted on 2016-10-04 17:15 YingyiJeniffer 阅读(421) 评论(0) 编辑 收藏 举报