Gini Index and Entropy
Thank you to everyone who participated in the decision trees meeting on Feb 27th. The speaker (me) is grateful for the members who pointed out two things the speaker did not fully handle at the meeting:
1) improving clarity of entropy concept and the Gini Calculations used in determining the purity of subgroups splits. (please see the wisdom>meetings>decision trees to refresh).
2) facilitating the discussion on what types of attributes can be used to split decision trees. The question was whether we can combine two attributes (eg. expected return and risk) to create just one attribute to split over (eg. Sharpe ratio [Mimetex cannot convert this formula]). The speaker said that it was OK, but a member pointed out that this is no longer called a decision tree. At the time of this post, I am still unsure. I hope that our member will send his contributions to [Math Email] so that we are able to resolve the issue.
In this post, I will be focusing on the only on Entropy and Gini Calculations. I will try to answer these questions: why do we need a measure of "entropy?"; why does this formula for entropy make sense?; what is the relationship between entropy and the Gini index?
Entropy is the measure of "disorder" in a system (for those of you who know physics and information theory, please excuse my lack of total rigor. The actual definition of entropy in information theory differs slightly from the definition in physics). In order to categorize items, you will want to put similiar items together. Consider washing clothes and let's say your father is helping you out. He wants to split clothes over the "dirtiness" output (Dirty - Yes / No). That is, he smells each item he picks up to throw clothes into two piles: clean and dirty. Of course, he will not split them perfectly. After splitting the piles, your father gets tired and leaves the room. You come back to find these two piles and want to test whether your father's rudimentary way of splitting clothes has been successful. Now lets put some numbers to this example.
The mixed starting pile started with 15 pieces of clothing. The father split the clothes into two piles of 10 dirty and 5 clean (b/c you are smelly). After, you looked at the pile of "dirty" clothes and notice a shirt that is definitely clean. You then looked at the clean pile and notice a pair socks that have been dirty so long they stopped smelling. (I emphasize that the father was using only his sense of smell). Keep in mind: We are judging whether the father did a good job splitting the clothes. We started with a pile of 15, 10 dirty and 5 clean. We ended with two piles (each pile would be called a leaf in decision tree talk). In Pile-smells-dirty: 9 actually dirty and 1 actually clean. Pile-smells-clean: 4 actually clean, 1 actually dirty.
Our formula for entropy should tell us how much impurity we have in each pile. Let's suppose that you had only dirty (case 1) clothes in a pile - 15 pieces, 15 dirty, 0 Clean. Would you need to split the pile? .... Obviously the pile is perfectly "pure" (in the decision tree sense, not the smelly sense) -- purely dirty. What about the other way around? 15 pieces, 0 dirty, and 15 clean. Here, too, you would have a perfectly "pure" group. Let's say we had a pile of (case 2) 7 dirty and 8 clean -- this would have the same purity as 8 dirty and 7 clean. Also, this is the maximum impurity this pile could have! The Gini Index is going to capture all of that in one simple measure.
[Mimetex cannot convert this formula]
Case 1: Probability of dirty = 1, probability of clean = 0; [Mimetex cannot convert this formula] (This is great!)
Case 2: Probability of dirty = [Mimetex cannot convert this formula], Probability of clean = [Mimetex cannot convert this formula]; [Mimetex cannot convert this formula]
Thought question: what is the maximum possible Gini index? (note that the higher the Gini value the more entropy / disorder you have). Hint: consider maxmimum disorder of 8 dirty and 8 clean...
FAQ: What's with all the logs in the presentation? In the two class case (also known as binary), the Gini Index will have a 1-to-1 correspondence (and preserve order!) with the way impurity is defined in the slides (the measure that uses log base 2). This point is not critical to consider because it is just a different measure that ranges in impurity from 0 to 1. However, this is a very important known result from Shannon's information theory.
Further information (these will be harder to read than my example but worth it):
