What’s New ?

The Top 10 favtutor Features You Might Have Overlooked

Read More

Gini Impurity A-Z (Decision Tree)

  • Jun 09, 2021
  • 5 Minutes Read
  • Why Trust Us
    We uphold a strict editorial policy that emphasizes factual accuracy, relevance, and impartiality. Our content is crafted by top technical writers with deep knowledge in the fields of computer science and data science, ensuring each piece is meticulously reviewed by a team of seasoned editors to guarantee compliance with the highest standards in educational content creation and publishing.
  • By Navoneel Chakrabarty
Gini Impurity A-Z (Decision Tree)

 

In Tree-based models, there is a criterion for selecting the best split-feature based on which the root of say, a Decision Tree gets split into child nodes (sub-samples of the total data in the root and so on) and hence, a decision is made. So, in a Decision Tree split-feature is the judge and child nodes represent the judgements. The basic intuition of finding the best split of the root or any internal node of a Decision Tree is that, the each of the child nodes to be created, should be as homogeneous as possible. In other words, the each of the child nodes to be created should have most of the instances with target labels belonging to the same class. In order to achieve so, there are 2 most popular criteria which is very common among Machine Learning practitioners:

1. Gini Impurity

2. Entropy and Information Gain

In this article, the criterion, Gini Impurity and it's application in Tree-based Models is discussed.

All you need to know about Gini Impurity

Gini Index

Gini Index is a popular measure of data homogeneity. Data Homogeneity refers to how much polarized is the data to a particular class or category. Let us consider an example of an Exploratory Analyzed Data of people winning or losing a tournament, given their Age and Gender:

gini impurity

So, there are 4 blocks of analyzed data. The labels 'P' and 'N' indicate number of wins and losses respectively.

Gini Index (GI) is defined as,

Gini Index formula

From the definition, it is evident that for perfectly homogeneous data block, the Gini Index is equal to 1.

Now, in this example, there are 2 features, Gender and Age and the target label is win/loss i.e., outcome of the tournament. GI is calculated for each and every feature and the feature with the highest value is to selected as the best split-feature.

For calculating the Gini Index for Gender, Gini Index of Male (M) and Female (F) categories need to be calculated

gini index gender

Similarly, for calculating the Gini Index of Age, Gini Index of labels '<50' i.e., age less than 50 and '>=50' i.e., age greater than or equal to 50 need to be calculated

 

gini index age

So, as Gini Index(Gender) is greater than Gini Index(Age), hence, Gender is the best split-feature as it produces more homogeneous child nodes.

 

Gini Impurity

Now, Gini Impurity is just the reverse mathematical term of Gini Index and is defined as,

Gini Impurity

So, it is a measure of anti-homogeneity and hence, the feature with the least Gini Impurity is selected to be the best split feature.

Now, following the above example, Gini Impurity can be directly calculated for each and every feature.

Calculating Gini Impurity for Gender, Gini Impurity of Male (M) and Female (F) need to be calculated

gini impurity for gender

Similarly, for calculating the Gini Impurity of Age, Gini Impurity of labels '<50' i.e., age less than 50 and '>=50' i.e., age greater than or equal to 50 need to be calculated

gini impurity age

So, as Gini Impurity(Gender) is less than Gini Impurity(Age), hence, Gender is the best split-feature.

Conclusion

So, in this way, Gini Impurity is used to get the best split-feature for the root or any internal node (for splitting at any level), not only in Decision Trees but any Tree-Model. 

FavTutor - 24x7 Live Coding Help from Expert Tutors!

About The Author
Navoneel Chakrabarty
I'm Navoneel Chakrabarty, a Data Scientist, Machine Learning & AI Enthusiast, and a Regular Python Coder. Apart from that, I am also a Natural Language Processing (NLP), Deep Learning, and Computer Vision Enthusiast.