Coding Infinite https://codinginfinite.com/ Your infinite Coding Solutions Sat, 29 Jul 2023 22:12:54 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://codinginfinite.com/wp-content/uploads/2018/07/CODING-INFINITE-FAVICON.png Coding Infinite https://codinginfinite.com/ 32 32 Ensembling Techniques in Machine Learning https://codinginfinite.com/ensembling-techniques-in-machine-learning/ Sat, 29 Jul 2023 13:00:00 +0000 https://codinginfinite.com/?p=4907 We use various techniques to achieve greater accuracy in machine learning models. One such method is ensemble learning. In this article, we will discuss the basics of ensemble learning. We will discuss the various ensembling techniques and the differences between them. What is Ensemble Learning? Ensemble learning is a technique to build machine learning applications...

The post Ensembling Techniques in Machine Learning appeared first on Coding Infinite.

]]>
We use various techniques to achieve greater accuracy in machine learning models. One such method is ensemble learning. In this article, we will discuss the basics of ensemble learning. We will discuss the various ensembling techniques and the differences between them.

What is Ensemble Learning?

Ensemble learning is a technique to build machine learning applications using multiple ML models instead of a single model. Here, an ensemble consists of various machine-learning models that participate in deciding the output of the machine-learning application. Each model in the ensemble is calibrated to reduce the bias and variance in the machine-learning applications while predicting the outputs.

When we train a machine learning model, it might run into overfitting or underfitting. In both cases, the predictions of the ML model aren’t very accurate.  When we create an ensemble of multiple models and assign them weights to predict the final output, the combination can reduce the bias and variance leading to better performance of the machine learning application.

Some examples of ensemble learning algorithms include Random forests, AdaBoost, Gradient Boosting, and XGBoost.

What Are The Different Ensembling Techniques in Machine Learning?

Based on the implementation details, we can divide the Ensemble Learning techniques into three categories.

  1. Bagging
  2. Boosting
  3. Stacking

Let us discuss each technique separately.

Bagging in Machine Learning

Bagging is an acronym for Bootstrap Aggregation. It is primarily used in supervised machine learning applications like classification and regression. In bagging, a machine learning model consists of several small models. The entire model is termed the primary model and the smaller machine learning models are terms base models. All the base models are trained on different samples of training data. In bagging, each base model works independently and is not affected by other base models.

While predicting output for new data points, the predictions of all the base models are aggregated and the final output of the primary model is decided by assigning weights to the outputs of the based models.

As the name suggests, the Bagging technique consists of two steps.

  1. Bootstrapping: In bootstrapping, we create random samples from the data with replacements. The data samples are then fed to the machine learning model. Different base models inside the primary machine learning model are then trained on the data samples. Any new data point is processed by all the base models to predict the output.
  2. Aggregation: In aggregation, predictions from all the base models are aggregated. Then, the final output is generated by calculations on the weights of the base models and their outputs.

To understand Bagging in Machine learning, let us take the example of the Random forest algorithm.

In the random forest algorithm, we use multiple decision trees to generate the final output. While training,  different decision trees are trained using samples of the input data.

  • For regression tasks, we can predict the output of each decision tree. Then, we can take the mean, median, or weighted average of the outputs of the decision trees to generate the final output for the random forest regression.
  • Similarly, if we are using a random forest algorithm for classification, we can use the majority or weighted average of classification scores of each decision tree to classify any new data point fed to the random forest classifier.

Bagging helps us reduce variance for our machine learning model.

Boosting in Machine Learning

Boosting is another ensembling technique that we use to train machine learning models for better performance. In boosting, we combine a set of weak learners into strong learners. Here, the base models are dependent on each other for predictions. Boosting optimizes the loss function of weak learners. By iteratively improving the weak learners, boosting helps us reduce the bias in our machine-learning model.

In boosting, we use the following steps.

  • First, a base model is trained on the input data by assigning equal weights to each data point. 
  • Then, the incorrect predictions made by the base model are identified. After identifying the data points for which the predictions are wrong, we assign higher weights to the data points. 
  • Next, the weighted data is fed to the next base model. Again, the predictions are analyzed, and the data points with incorrect predictions are given higher weightage and given as input to this next base model. 

This process of sequentially passing outputs of one base model to another base model creates an ensemble that performs better than all the base models. Hence, the weak learners are combined to create the final machine learning model with better performance. 

What Are The Different Types of Boosting Techniques?

Although all the machine learning algorithms using boosting combine weak learners to create a strong learner for building high-performance classification and regression models, they can differ in how they create and aggregate weak learners during the sequential learning process. Based on the differences, we use the following boosting techniques.

  1. Adaptive Boosting (AdaBoost)
  2. Gradient Boosting
  3. Extreme Gradient Boosting (XGBoost)

Adaptive Boosting (AdaBoost)

AdaBoost is primarily used in training classification models. In adaptive boosting, the weak learners take into account a single feature and create a single split decision tree that we name as decision stump. While creating the decision stump, each observation in the input data is weighted equally.  Once we create the decision stumps in the first iteration, we analyze the prediction results. If we get any observations with incorrect outputs, we assign them higher weights. Then, new decision stumps are created considering that the observations with higher weights are more significant. 

The above process is executed iteratively by identifying incorrect predictions and adjusting the weight of the data points until we get the correct outputs for the training data points to the maximum extent. 

Gradient Boosting

The gradient boosting algorithm is also based on sequential learning and can be used in classification and regression tasks. However, it differs from Adaptive boosting.  In the gradient boosting algorithm, the focus is on minimizing the error of the base model instead of assigning weights to input data points with incorrect predictions. For this, we use the following steps.

  1. First, a weak learner is trained on a data sample and the predictions are obtained.
  2. Then, we add a new we learner sequentially after the previous base model. The new base model tries to optimize the loss function. In gradient boosting, we don’t add weights to the incorrectly predicted data points. Instead, we try to minimize the loss function for the weak learners that we are using.
  3. After each iteration, we add a new base model with an optimized loss function until we get satisfactory results.

Extreme Gradient Boosting (XGBoost)

As the name suggests, XGBoost is an advanced version of the gradient boosting algorithm. The XGBoost algorithm was designed to increase the speed and accuracy of the gradient boosting algorithm. It uses parallel processing, cross-validation, regularization, and cache optimization to increase the computational speed and model efficiency. 

Other ensembling algorithms using various forms of boosting techniques are  LightGBM (Light Gradient Boosting Machine), CatBoost, etc. 

Disadvantages of Boosting

Apart from its advantages, using boosting techniques to train machine learning algorithms also poses some challenges as discussed below.

  • Boosting sometimes can lead to overfitting.
  • We sequentially train the weak learners in all the boosting ensembling techniques. Since each learner is built on its predecessor, boosting can be computationally expensive and is hard to scale up. However, algorithms like XGBoost tackle the issue of scalability to a great extent.
  • Boosting algorithms are sensitive to outliers. As each model attempts to predict the target values correctly for the data points in the training set, outliers can skew the loss functions of the base learners significantly.

Stacking in Machine Learning

Stacking uses different levels of machine learning models to create classification or regression models. In stacking, we first train multiple weak learners in a parallel manner. The predictions of the weak learners are then fed to another machine-learning model for training and predictions.

Bagging vs Boosting: Differences Between The Ensembling Techniques

Bagging and boosting gather major attention out of the three ensembling techniques due to their larger adoption for creating classification and regression models. Hence, it is important to discuss the similarities and differences the bagging and boosting ensembling techniques. The following table summarizes the difference between bagging and boosting.

BaggingBoosting
Bagging focuses on minimizing the variance.Boosting focuses on minimizing bias.
The base models in bagging are independent of each other.In boosting, the base models are dependent on each other as they are trained sequentially.
In bagging, each base model has a different weight.In boosting, no weights are assigned to the models and they all have the same weights.
The base models work parallelly in bagging.The base models work sequentially in boosting.
In bagging, the data points in the training samples are decided using row sampling with random sampling with replacement.In boosting, each new sample is decided by the factors that are misclassified by the previous models.
Bagging trains faster.Boosting is slower to train compared to bagging.
Bagging vs Boosting in Machine Learning

Conclusion

In this article, we discussed various ensembling techniques in machine learning that you can use to build classification and regression models. To learn more about machine learning, you can read this article on overfitting and underfitting in machine learning. You might also like this article on Naive Bayes classification numerical example.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Ensembling Techniques in Machine Learning appeared first on Coding Infinite.

]]>
Naive Bayes Classification Numerical Example https://codinginfinite.com/naive-bayes-classification-numerical-example/ Sat, 22 Jul 2023 13:00:00 +0000 https://codinginfinite.com/?p=4904 We use different classification algorithms to build classifiers in machine learning. The naive Bayes classification algorithm is one of the easiest classification algorithms to understand and implement. In this article, we will discuss the Bayes algorithm and the intuition of Naive Bayes classification. We will also discuss a numerical example of Naive Bayes classification to...

The post Naive Bayes Classification Numerical Example appeared first on Coding Infinite.

]]>
We use different classification algorithms to build classifiers in machine learning. The naive Bayes classification algorithm is one of the easiest classification algorithms to understand and implement. In this article, we will discuss the Bayes algorithm and the intuition of Naive Bayes classification. We will also discuss a numerical example of Naive Bayes classification to understand it in a better manner.

The Bayes’ Theorem

Before discussing the Naive Bayes classification algorithm, we need to understand the Bayes theorem. We can state the formulae for the Bayes algorithm as shown below. 

P(A/B)=  P(B/A)* P(A)/P(B)

Here,

  • A is called the hypothesis.
  • B is the evidence.
  • P(A) is termed as prior probability. It is the probability of occurrence of the hypothesis.
  • P(B) is termed marginal probability. It is the probability of occurrence of the evidence.
  • P(B/A) is called likelihood. It is the probability of occurrence of B given that A has already occurred.
  • P(A/B) is called posterior probability. It is the probability of occurrence of A given that B has already occurred.

From where do we get the Bias theorem?

The Bayes theorem is directly derived from the formulas of conditional probability. For instance, you might have studied the conditional probability formulae given below.

P(A/B)=P(A∩B)/P(B)

Here, 

  • P(B) is the probability of occurrence of event B.
  • P(A∩B) is the probability of occurrence of events A and B together.
  • P(A/B) is the probability of occurrence of event A given that B has already occurred.

In a similar manner, we can write the above formulae as shown below.

P(B/A)=P(A∩B)/P(A)

Here, 

  • P(A) is the probability of occurrence of event A.
  • P(A∩ B) is the probability of occurrence of events A and B together.
  • P(B/A) is the probability of occurrence of event B given that A has already occurred.

Now, if we extract the probability P(A∩B) from both formulas, we get the following.

P(A∩B)=P(B/A)*P(A)
P(A∩B)=P(A/B)*P(B)

 When we equate both formulas, we get the following equation.

P(B/A)*P(A)=P(A/B)*P(B)

From the above equation, we can get the posterior probability P(A/B) as shown below.

P(A/B)=P(B/A)*P(A)/P(B)

Similarly, we can get the posterior probability P(B/A) as shown below.

P(B/A)=P(A/B)*P(B)/P(A)

The above two formulas represent the Bayes theorem in alternate forms. 

Bayes Theorem Numerical Example

To understand the Bayes theorem, consider the following problem. 

You are given a deck of cards. You have to find the probability of a card being king if you know that it is a face card. 

We will approach this problem as follows.

  • Let A be the event of a given card being a face card.
  • Let B be the event of a card being a King.
  • Now, if we need to find the probability of a card being king if you know that it is a face card, we need to find the probability P(B/A). 

Using Bayes theorem,

P(B/A)=P(A/B)*P(B)/P(A)

To find P(B/A), we need to find the following probabilities.

  • P(A) i.e. the probability of a card being a face card. As there are 12 face cards out of 52, P(A)=12/52.
  •  P(B) i.e. the probability of a card being a King. As there are 4 Kings, P(B)=4/52.
  • P(A/B) i.e. the probability of a King being a face card. As all the kings are face cards, P(A/B)=1.

Now, using Bayes theorem, we can easily find the probability of a card being a King if it is a face card.

P(B/A)=P(A/B)*P(B)/P(A)
      =1*(4/52)/(12/52)
      =4/12
      =1/3

Hence, the probability of a card being a King, if it is a face card, is 1/3. I hope that you have understood the Bayes theorem at this point. Now, let us discuss the Naive Bayes classification algorithm. 

What is The Naive Bayes Classification Algorithm?

The naive Bayes classification algorithm is a supervised machine learning algorithm based on the Bayes theorem. It is one of the simplest and most effective classification algorithms that help us build efficient classifiers with minimum training and computation costs. 

In the Naive Bayes algorithm, we assume that the features in the input dataset are independent of each other. In other words, each feature in the input dataset independently decides the target variable or class label and is not affected by other features. While the assumption doesn’t hold true for most of the real-world classification problems, Naive Bayes classification is still one of the goto algorithms for classification due to its simplicity.

Naive Bayes Classification Numerical example

To implement a Naive Bayes classifier, we perform three steps. 

  1. First, we calculate the probability of each class label in the training dataset.
  2. Next, we calculate the conditional probability of each attribute of the training data for each class label given in the training data.
  3. Finally, we use the Bayes theorem and the calculated probabilities to predict class labels for new data points. For this, we will calculate the probability of the new data point belonging to each class. The class with which we get the maximum probability is assigned to the new data point.

To understand the above steps using a naive Bayes classification numerical example, we will use the following dataset.

Sl. No.ColorLegsHeightSmellySpecies
1White3ShortYesM
2Green2TallNoM
3Green3ShortYesM
4White3ShortYesM
5Green2ShortNoH
6White2TallNoH
7White2TallNoH
8White2ShortYesH
Dataset For Naive Bayes Classification

Using the above data, we have to identify the species of an entity with the following attributes.

X={Color=Green, Legs=2, Height=Tall, Smelly=No}

To predict the class label for the above attribute set, we will first calculate the probability of the species being M or H in total.

P(Species=M)=4/8=0.5
P(Species=H)=4/8=0.5

Next, we will calculate the conditional probability of each attribute value for each class label.

P(Color=White/Species=M)=2/4=0.5
P(Color=White/Species=H)=¾=0.75
P(Color=Green/Species=M)=2/4=0.5
P(Color=Green/Species=H)=¼=0.25
P(Legs=2/Species=M)=1/4=0.25
P(Legs=2/Species=H)=4/4=1
P(Legs=3/Species=M)=3/4=0.75
P(Legs=3/Species=H)=0/4=0
P(Height=Tall/Species=M)=3/4=0.75
P(Height=Tall/Species=H)=2/4=0.5
P(Height=Short/Species=M)=1/4=0.25
P(Height=Short/Species=H)=2/4=0.5
P(Smelly=Yes/Species=M)=3/4=0.75
P(Smelly=Yes/Species=H)=1/4=0.25
P(Smelly=No/Species=M)=1/4=0.25
P(Smelly=No/Species=H)=3/4=0.75

We can tabulate the above calculations in the tables for better visualization. 

The conditional probability table for the Color attribute is as follows.

ColorMH
White0.50.75
Green0.50.25
Conditional Probabilities for Color Attribute

The conditional probability table for the Legs attribute is as follows.

LegsMH
20.251
30.750
Conditional Probabilities for Legs Attribute


The conditional probability table for the Height attribute is as follows.

HeightMH
Tall0.750.5
Short0.250.5
Conditional Probabilities for Height Attribute

The conditional probability table for the Smelly attribute is as follows.

SmellyMH
Yes0.750.25
No0.250.75
Conditional Probabilities for Smelly Attribute

Now that we have calculated the conditional probabilities, we will use them to calculate the probability of the new attribute set belonging to a single class.

Let us consider X= {Color=Green, Legs=2, Height=Tall, Smelly=No}.

Then, the probability of X belonging to Species M will be as follows.

P(M/X)=P(Species=M)*P(Color=Green/Species=M)*P(Legs=2/Species=M)*P(Height=Tall/Species=M)*P(Smelly=No/Species=M)
      =0.5*0.5*0.25*0.75*0.25
      =0.0117

Similarly, the probability of X belonging to Species H will be calculated as follows.

P(H/X)=P(Species=H)*P(Color=Green/Species=H)*P(Legs=2/Species=H)*P(Height=Tall/Species=H)*P(Smelly=No/Species=H)
      =0.5*0.25*1*0.5*0.75
      =0.0468

So, the probability of X belonging to Species M is 0.0117 and that to Species H is 0.0468. Hence, we will assign the entity X with attributes  {Color=Green, Legs=2, Height=Tall, Smelly=No} to species H. 

In this way, we can predict the class label for any number of new data points.

What are the Different Types of Naive Bayes Models?

Based on the use cases and features of input data, naive Bayes classifiers can be classified into the following types.

  • Gaussian Classifiers: The Gaussian Naive Bayes classifier assumes that the attributes of a dataset have a normal distribution. Here, if the attributes have continuous values, the classification model assumes that the values are sampled from a Gaussian distribution.
  • Multinomial Naive Bayes Classifier: When the input data is multinomially distributed, we use the multinomial naive Bayes classifier.  This algorithm is primarily used for document classification problems like sentiment analysis.
  • Bernoulli Classifiers: The Bernoulli Naive Bayes classification works in a similar manner to the multinomial classification. The difference is that the attributes of the dataset contain boolean values representing the presence or absence of a particular attribute in a data point.

Advantages and Disadvantages of Naive Bayes Classification Algorithm

Due to its simple implementation, the naive Bayes classifier has the following advantages.

  • The naive Bayes classification algorithm is one of the fastest and easiest machine learning algorithms for classification. 
  • We can use the Naive Bayes classification algorithm for building binary as well as multi-class classification models.
  • The Naive Bayes algorithm performs better than many classification algorithms while implementing multi-class classification models. 

Apart from its advantages, the naive Bayes classification algorithm also has some drawbacks. The algorithm assumes that the attributes of the training dataset are independent of each other. This assumption is not always True. Hence, when there is a correlation between two attributes in a given training set, the naive Bayes algorithm will not perform well.

Suggested Reading: Bias and Variance in Machine Learning 

Applications of The Naive Bayes Classification Algorithm

The Naive Bayes classification algorithm is used in many classifiers.

  • The most popular use of the Naive Bayes classification algorithm is in text classification. We often build spam filtering and sentiment analysis models using the naive Bayes algorithm.
  • We can use the Naive Bayes classification algorithm to build applications to predict the credit score and loan worthiness of customers in a bank.
  • The Naive Bayes classifier is an eager learner. Hence, we can use it for real-time predictions too.
  • We can also use the Naive Bayes classification algorithm to implement models for detecting diseases based on the medical results of the patients.

Conclusion

In this article, we discussed the Bayes theorem and the Naive Bayes classification algorithm with a numerical example. To learn more about machine learning algorithms, you can read this article on KNN classification numerical example. You might also like this article on overfitting and underfitting in machine learning

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Naive Bayes Classification Numerical Example appeared first on Coding Infinite.

]]>
Overfitting and Underfitting in Machine Learning https://codinginfinite.com/overfitting-and-underfitting-in-machine-learning/ Sat, 15 Jul 2023 13:00:00 +0000 https://codinginfinite.com/?p=4897 Overfitting and Underfitting are two of the common issues that we face while training machine learning models. In this article, we will discuss overfitting and underfitting in machine learning. We will also discuss how to avoid overfitting and underfitting using various techniques.  Before we start with a discussion on overfitting and underfitting, I suggest you...

The post Overfitting and Underfitting in Machine Learning appeared first on Coding Infinite.

]]>
Overfitting and Underfitting are two of the common issues that we face while training machine learning models. In this article, we will discuss overfitting and underfitting in machine learning. We will also discuss how to avoid overfitting and underfitting using various techniques. 

Before we start with a discussion on overfitting and underfitting, I suggest you read this article on bias and variance in machine learning.

What is Overfitting in Machine Learning?

Overfitting is a phenomenon in which a machine-learning model accurately predicts the target value for all the data points in the training data but fails to give reliable output for unseen data points. 

To understand overfitting, suppose that we have the following data points.

xy
30110
40105
50120
60110
70221
Dataset for explaining overfitting and underfitting

In this dataset, we have five values of x with 5 associated values of y. We need to train a machine learning model to predict the value y for any given x. If we try to train a machine learning model for this data set, the regression line for a trained but overfitted machine learning model will look as follows.

Regression Line after Overfitting of ML Model
Regression Line after Overfitting of ML Model

In the above image, you can observe that the regression line depicting the machine learning algorithm passes through all the input data points. Due to this, the model will have very very low bias and very high variance. It will not be able to predict accurate values of y for unseen x values.

You might say that the regression model that generates the above regression line will be efficient as it accurately predicts y for all the x values in training data. However, this isn’t our goal. Our goal for training a machine learning model is to create a generalized regression model that can predict the y values for training and new x values to a certain degree of accuracy. If we don’t get a generalized model, this purpose isn’t served.

From Overfitting to Underfitting in Machine Learning

For the dataset given in the previous example, if we start decreasing the number of parameters i.e. the degree of the regression function, we will start getting generalized models. For instance, the regression line for a regression function with degree value 3 will look like this.

Regression line with degree 3

In the above image, you can observe that the regression line is capturing the trends in the data by passing as close as possible to the training data points. Due to this, the regression model can generalize well and it can also accurately predict y values for unseen x. Hence, we can say that we have removed overfitting from the model.

Now, if we decrease the degree of the regression function to 2 in the model, the regression line will look as follows.

In the above image, we have set the degree of regression function to 2. Due to this, the number of model parameters decreases further. In the image, you can observe that the regression line isn’t passing by any of the input data points. However, it’s still capturing the general trend in the data. Thus, it possibly can predict y values for unseen x values with a certain accuracy.

Finally, if we decrease the degree of the regression function to 1, i.e. if we map the data points with a linear regression function, the regression line will look as follows.

In the above image, you can observe that the regression line doesn’t even capture the trends of the input data points correctly. Due to this, the regression model will not perform well for training as well as unseen data points. This phenomenon is called underfitting. Thus, if we keep decreasing the parameters for the machine learning model while training, we move from overfitting to underfitting.

When underfitting occurs, the machine learning model has high bias and low variance. It will not be able to learn the trends in the training data. Consequently, it will reduce the accuracy of the predictions. 

How to Avoid Overfitting and Underfitting?

Both overfitting and underfitting are bad for the machine learning models. Hence, we need to reduce overfitting and underfitting.

To reduce overfitting in a model, you can use the following techniques.

  • Train with a large dataset: When we train machine learning models on smaller datasets, it is possible that the model will not generalize and will lead to overfitting. Hence, using a sufficient amount of training data can also help us avoid overfitting.
  • Reduce the number of parameters: In the example images, you might have observed that when we decrease the degree of the regression function, the regression line is generalized more. This is due to the reason that number of parameters in the model decreases. Hence, if you decrease the number of parameters in the machine learning model, it will generalize more and you can avoid overfitting.
  • Use ensemble learning methods: In ensemble learning, we use multiple machine learning models instead of training a single model. After training multiple models, we assign them weights to generate the final output. For ensemble learning, we use techniques like bagging, boosting, and random forests. 
  • Use regularization: Regularization is used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting. We use regularization techniques like Lasso and Ridge regularization for this purpose.
  • Use cross-validation: In cross-validation, we use different samples of data for training and testing in different iterations. Here, we first divide the dataset into multiple samples. Then, we take one of the samples of the data as a validation set and train the machine learning model with the rest of the data. After training, we evaluate the performance of the model using the validation data set. We repeat this process multiple times, each time using a different sample as the validation set. Finally, the results from each validation step are averaged to produce a more robust estimate of the model’s performance. This helps us avoid model overfitting and provides a realistic estimate of the model’s generalization.

To avoid underfitting the machine model, you can increase the number of features in the training data or the number of parameters in the model. In the examples shown above, you might have observed that when we move from linear functions to functions with higher degrees, the underfitting decreases.

In real-world applications, we need to select a model in between the overfitted or underfitted model. For instance, if we have to train a regression model with the data given in this article, the linear model causes underfitting whereas if we increase the degree of the regression function to 5, we get an overfitted model. However, for regression functions with degrees 2 and 3, we get pretty much-generalized regression lines. Hence, we can use any of these models by evaluating them on different parameters. 

Conclusion

In this article, we discussed overfitting and underfitting in machine learning. To learn more about machine learning topics, you can read this article on KNN regression. You might also like this article on categorical data encoding techniques.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Overfitting and Underfitting in Machine Learning appeared first on Coding Infinite.

]]>
Bias and Variance in Machine Learning https://codinginfinite.com/bias-and-variance-in-machine-learning/ Sat, 08 Jul 2023 13:00:00 +0000 https://codinginfinite.com/?p=4895 When we train machine learning models, we might not always get very good results. We often run into situations where overfitting or underfitting occurs due to which the performance of the machine learning model deteriorates. This happens due to bias and variance errors. In this article, we will discuss these errors in machine learning and...

The post Bias and Variance in Machine Learning appeared first on Coding Infinite.

]]>
When we train machine learning models, we might not always get very good results. We often run into situations where overfitting or underfitting occurs due to which the performance of the machine learning model deteriorates. This happens due to bias and variance errors. In this article, we will discuss these errors in machine learning and how to identify them.

What is Bias in Machine Learning?

When we train a machine learning model on a dataset, it captures the patterns in the data and uses them for predicting results for new data points or test data. The predictions by the machine learning model might not be equal to the expected or actual target values for the test data. The difference between the actual and predicted target values is an error and is termed an error due to bias.

We can define bias as “the inability of a machine learning model to capture the true relationship between the data points and the target values.”

Each ML algorithm works on some assumptions. Hence, each machine learning model has an inherent bias due to the assumptions. For example, a linear regression model expects the data points and target values to be linearly independent. Similarly, the K-Means clustering algorithm assumes that the clusters in the input data are elliptical/circular in shape. On the other hand, the Naive Bayes classification algorithm assumes that the features in the input data are independent of each other.

These assumptions aren’t always valid for real-world data. Due to this, we encounter the following biases.

  • Low Bias: A low-bias model makes fewer assumptions about the data or the target function on which we train the model. Examples of low-bias models include decision trees and the K-Nearest Neighbors algorithm
  • High Bias: High bias comes in a machine learning model if it makes many assumptions and doesn’t actually capture the features of the training data. Due to this, the predictions of a high-bias model are often inaccurate. Examples of machine learning algorithms with high bias are linear regression, Naive Bayes algorithm, logistic regression, etc. These models are easy to train due to bias but they perform well on only those datasets that conform to the assumptions.

What is Variance in Machine Learning?

As the name suggests, variance measures the amount of variation in the predictions of a machine-learning model. It specifies the extent to which a model can adjust depending on the given data set. Mathematically, variance is the measure of how much a predicted variable differs from its expected value. Again, variance can be of two types i.e. low variance and high variance.

  • Low variance: Low variance for a machine learning algorithm specifies that there is a very small change in predictions when we change the input dataset. Ideally, a machine learning model should not vary the results much when the input data is changed. In such a situation, it is considered that the model has a good understanding of the relationships between different attributes in the dataset.
  • High variance: A model with high variance shows a large variation in the predictions when we change the input data. When we train a machine learning algorithm that has a high variance, the model learns a lot and performs well with the training data. However, when we pass an unseen dataset as input, it shows large variations in the output predictions. This isn’t a desired situation.

How to Identify Bias and Variance Errors?

To identify the bias and variance errors, you can use the following tricks.

  • If a machine learning model performs well on training data but has a high error rate for test data or any unseen data, it will have high variance.
  • If a model shows large errors in seen as well as unseen data, it will have a high bias.

Thus, small training and large test errors denote high variance. Large training and large test errors show high bias for a given machine learning algorithm.

Bias-Variance Tradeoff: What is The Optimal Solution?

Based on the situation, we can have one of the following situations for bias and variance errors in a machine learning model.

  • Low variance- low bias: If a machine learning model has low variance and low bias, it will perform best and is an ideal situation for us. However, it is not possible to achieve low variance and low bias for a model in practical situations as real-world data doesn’t conform to any theoretical assumption. Hence, no algorithm is perfect for a given dataset.
  • Low variance- high bias: When a model has low variance and high bias, the predictions are consistent. However, they are inaccurate. In this situation, the machine learning algorithm is trained with very few parameters and the model doesn’t learn well from the training data.
  • High variance- low bias: When a machine learning model has high variance and low bias, the predictions will be inconsistent. However, they are accurate. In this situation, the machine learning model leads to overfitting and doesn’t generalize well. This is due to the reason that the machine learning algorithm is trained on a large number of parameters.
  • High variance- high bias: In case of high variance and high bias, the predictions of a machine learning model are inconsistent as well as inaccurate. This is not a desired situation at all.

When we train a machine learning model, we need to make adjustments between bias and variance to achieve optimum performance. Bias and variance are inversely correlated. Due to this, when we increase bias, variance will decrease. Similarly, when we decrease the bias, variance increases. 

For a model to perform best, bias and variance both should be low. However, we cannot achieve this situation. Hence, we try to achieve the parameter values of a given algorithm for which bias and variance are optimal. In this case, our machine learning model will generalize well. At the same time, it will also have an acceptable bias.

Conclusion

In this article, we discussed bias and variance in machine learning. To learn more about machine learning algorithms, you can read this article on entity embedding in Python. You might also like this article on label encoding

I hope you enjoy this reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Bias and Variance in Machine Learning appeared first on Coding Infinite.

]]>
Entity Embedding in Python https://codinginfinite.com/entity-embedding-in-python/ Sat, 01 Jul 2023 13:00:00 +0000 https://codinginfinite.com/?p=4884 We often use categorical data encoding techniques such as label encoding and one hot encoding while data preprocessing. While these techniques offer an easy solution to convert categorical data to a numeric format, the representations are often inaccurate. In this article, we will discuss how to perform entity embedding to convert categorical data into a...

The post Entity Embedding in Python appeared first on Coding Infinite.

]]>
We often use categorical data encoding techniques such as label encoding and one hot encoding while data preprocessing. While these techniques offer an easy solution to convert categorical data to a numeric format, the representations are often inaccurate. In this article, we will discuss how to perform entity embedding to convert categorical data into a numeric format while preserving all the characteristics of the original data. We will also implement entity embedding in Python using the Tensorflow and Keras modules.

What is Entity Embedding?

Entity embedding is a technique in which we use neural networks to convert categorical data to a numerical format. In entity embedding, we represent categorical values in a tabular dataset using continuous numeric values in multiple dimensions.

For example, consider that we have the following data. 

NameCity
John SmithNew York
Aditya RajMumbai
Will SmithLondon
Harsh AryanLondon
Joel HarrisonMumbai
Bill WarnerParis
Chris KiteNew York
Sam AltmanLondon
JoeLondon
Data For Entity Embedding

If we convert the City column into a numerical format using entity embedding, we will get an output as follows.

NameCityCity_1City_2
John SmithNew York0.7213730.392310
Aditya RajMumbai-1.045558-0.285206
Will SmithLondon-1.1003850.259384
Harsh AryanLondon-1.1003850.259384
Joel HarrisonMumbai-1.045558-0.285206
Bill WarnerParis-0.2608391.056758
Chris KiteNew York0.7213730.392310
Sam AltmanLondon-1.1003850.259384
JoeLondon-1.1003850.259384
Data Encoded Using Entity Embedding

In the above table, you can observe that we have created two new columns City_1 and City_2. These columns represent the categorical values in the City column. But, how did we get these values? Let’s see.

What Embeddings Really Are?

Embeddings are continuous vector representations assigned to categorical variables. When we train a neural network using categorical values, the embedding vectors are created during the training process of a neural network. These vectors capture the underlying relationships and similarities between different categorical values. 

By representing categorical variables as continuous embedding vectors, we can effectively capture complex relationships and similarities between the values in a column. After creating the embeddings, we can use them as input to machine learning models to perform tasks like classification, regression, or recommendation.

How Many Dimensions Should We Create For a Column During Entity Embedding?

The vectors created using entity embedding are typically low-dimensional and have dense representations. This is in contrast to high-dimensional and sparse representations used in traditional methods like one-hot encoding. In entity encoding, each categorical value is mapped to a fixed-size vector, where each element of the vector represents a feature or attribute of the category. Here, you need to keep in mind that higher-dimensional embeddings can more accurately represent the relationships between values in a column.

However, increasing the dimensions in the embedding vectors increases the chance of overfitting. It also leads to slower training of the model. Hence, we use an empirical rule-of-thumb to define the number of dimensions in the embedding vector to be equal to ∜(Unique values in a column).

Why Should We Use Entity Embedding to Convert Categorical Data into Numerical Format?

We already have simpler techniques like label encoding and one hot encoding to convert categorical data into numerical format. Then, why should we use entity embeddings?

Following are some of the reasons why we should use entity embedding instead of one hot encoding while converting categorical data to a numerical format.

  •  Entity embedding produces a compact numerical representation compared to one hot encoding. If there are N unique values in a column, one hot encoding will generate N new columns while converting data into the numerical format. On the other hand, entity encoding can represent the same data in only ∜N features without using much information. Hence, entity embedding reduces sparsity in the data to a large extent.
  • In one hot encoding, the numeric values are in the form of 0s and 1s represented in a sparse manner. On the other hand, entity embedding produces continuous values. Due to this, entity encoding performs better and represents the true relationship between the data points.
  • One hot encoding ignores the relations between different values in a column. On the contrary, entity embeddings can map related values closer together in embedding space. Thus, it preserves the inherent continuity of the data

Looking at the above benefits, you can easily say that entity embeddings are a better option than one hot encoding. Hence, we should always prefer to use entity embeddings while converting categorical data to numerical format while data processing.

How to Perform Entity Embedding in Python?

To perform entity embedding in Python, we will use the TensorFlow and Keras modules. For this, we will use different functions as discussed below.

The categorical_column_with_vocabulary_list() Function

We use the categorical_column_with_vocabulary_list() function to create a VocabularyListCategoricalColumn object. It takes the column name as its first input argument and a list of unique values in the particular column as its second input argument. After execution, it returns a VocabularyListCategoricalColumn object. You can observe this in the following example.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
print(vocab_list)

Output:

VocabularyListCategoricalColumn(key='Cities', vocabulary_list=('New York', 'Mumbai', 'London', 'Paris'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In the above code, we have passed “Cities” as the feature name and four values in the vocabulary list. Here, you need to make sure that the values in the list are unique. Otherwise, the program will run into error.

After creating the VocabularyListCategoricalColumn, we can use the embedding_column() function to train a neural network object for generating embeddings.

The embedding_column() Function

The embedding_column() function takes the VocabularyListCategoricalColumn object as its first input argument and the desired number of features in the embedded data as its second input argument. After execution, it creates a trained EmbeddingColumn object as shown below.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
print(embedding_column)

Output:

EmbeddingColumn(categorical_column=VocabularyListCategoricalColumn(key='Cities', vocabulary_list=('New York', 'Mumbai', 'London', 'Paris'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=2, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x7f3c77a21b70>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True, use_safe_embedding_lookup=True)

In the above code, we have created an EmbeddingColumn by specifying the dimensions of the embeddings as 2. We can use this EmbeddingColumn object to generate entity embeddings using the DenseFeatures() function.

The DenseFeatures() Function

The DenseFeatures() function takes the trained EmbeddingColumn object as its input argument and returns a trained DenseFeatures() function as shown below.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
print(feature_layer)

Output:

<keras.feature_column.dense_features_v2.DenseFeatures object at 0x7f3c34723820>

Create Entity Embeddings Using The DenseFeatures() Function

We can use the DenseFeatures() function to generate entity embeddings for a column in our data. For this, we can pass a dictionary containing the column name and a list of values in the column that we passed to the categorical_column_with_vocabulary_list() function. After execution, the DenseFeatures() function returns a Tensor object with embeddings as shown below. 

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"Cities": ["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]}
tensor_obj=feature_layer(value_dict)
print(tensor_obj)

Output:

tf.Tensor(
[[-1.001625   -0.76165915]
 [ 0.25127193 -0.481     ]
 [ 0.5141091   0.18663265]
 [ 0.5141091   0.18663265]
 [ 0.25127193 -0.481     ]
 [-0.13489066 -0.5079209 ]
 [-1.001625   -0.76165915]
 [ 0.5141091   0.18663265]
 [ 0.5141091   0.18663265]], shape=(9, 2), dtype=float32)

In this code, we have passed a dictionary containing “Cities” as its key and a list containing different city names as its associated value to the object containing the DenseFeatures() function. After execution, we get a Tensor object containing 2-D vectors. Here, each vector represents a categorical value passed in the list given in the dictionary. You can observe that the same values get the same embedding vector as the output.

You can convert the above embeddings into a numpy array by invoking the numpy() method on the Tensor object. 

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"Cities": ["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]}
tensor_obj=feature_layer(value_dict)
feature_matrix=tensor_obj.numpy()
print(feature_matrix)

Output:

[[-0.23178566  1.0528516 ]
 [-1.3448706   0.08130983]
 [ 0.6036284   0.01220271]
 [ 0.6036284   0.01220271]
 [-1.3448706   0.08130983]
 [ 0.00780506  0.10220684]
 [-0.23178566  1.0528516 ]
 [ 0.6036284   0.01220271]
 [ 0.6036284   0.01220271]]

Finally, you can convert the numpy array to dataframe columns for representing the values given in the input as shown below. 

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("Cities", ["New York", "Mumbai", "London", "Paris"])
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"Cities": ["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]}
tensor_obj=feature_layer(value_dict)
feature_matrix=tensor_obj.numpy()
df=pd.DataFrame(feature_matrix,columns=["City_1","City_2"])
print("The dataframe with embeddings is:")
print(df)

Output:

The dataframe with embeddings is:
     City_1    City_2
0  0.758967 -0.290070
1 -0.756442 -0.193602
2  1.143431  0.574248
3  1.143431  0.574248
4 -0.756442 -0.193602
5  0.210023 -0.441719
6  0.758967 -0.290070
7  1.143431  0.574248
8  1.143431  0.574248

Entity Embedding on a Pandas DataFrame in Python

To perform entity encoding on a column in the pandas dataframe, we will first obtain the unique values in the given column as a list. Then, we will create embeddings for the values as discussed in the previous sections. Finally, we will merge the column containing the embedding values in the original dataframe as shown in the following example.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import callbacks, layers
df=pd.read_csv("sample_file .csv")
print("The input dataframe is:")
print(df)
column_values=df["City"].unique()
vocab_list=tf.feature_column.categorical_column_with_vocabulary_list("City",column_values )
embedding_column=tf.feature_column.embedding_column(vocab_list,dimension=2)
feature_layer=layers.DenseFeatures(embedding_column)
value_dict={"City": df["City"].values}
tensor_obj=feature_layer(value_dict)
feature_matrix=tensor_obj.numpy()
df_columns=pd.DataFrame(feature_matrix,columns=["City_1","City_2"])
df[["City_1","City_2"]]=df_columns
print("The dataframe with embeddings is:")
print(df)

Output:

The input dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The dataframe with embeddings is:
            Name      City    City_1    City_2
0     John Smith  New York  0.969665 -0.645429
1     Aditya Raj    Mumbai -0.320367  0.248256
2     Will Smith    London  0.710551 -0.027302
3    Harsh Aryan    London  0.710551 -0.027302
4  Joel Harrison    Mumbai -0.320367  0.248256
5    Bill Warner     Paris  0.177772 -0.151322
6     Chris Kite  New York  0.969665 -0.645429
7     Sam Altman    London  0.710551 -0.027302
8            Joe    London  0.710551 -0.027302

In the above example, we first loaded the data given in the previous table in a pandas dataframe. Then, we extracted the unique values in the "City" column using the unique() method. After this, created embeddings of the data using the functions discussed in the previous sections. Finally, we create a new dataframe using the embedding columns and append it to the original dataframe.

In the outputs, you can observe that we get a different embedding for the values every time we perform entity embedding. Hence, it is important to store the embeddings or at least the trained DenseFeatures() function so that you can reproduce the results while data pre-processing.

Conclusion

In this article, we discussed the basics of entity embedding. We also discussed how to implement entity embedding in Python. To learn more about machine learning topics, you can read this article on fp growth algorithm numerical example. You might also like this article on linear regression vs logistic regression.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Entity Embedding in Python appeared first on Coding Infinite.

]]>
One Hot Encoding in Python https://codinginfinite.com/one-hot-encoding-in-python/ Sat, 24 Jun 2023 16:27:39 +0000 https://codinginfinite.com/?p=4875 We use different categorical data encoding techniques while data analysis and machine learning tasks. In this article, we will discuss the basics of one hot encoding. We will also discuss implementing one hot encoding in Python. What is One Hot Encoding? One hot encoding is an encoding technique in which we represent categorical values with...

The post One Hot Encoding in Python appeared first on Coding Infinite.

]]>
We use different categorical data encoding techniques while data analysis and machine learning tasks. In this article, we will discuss the basics of one hot encoding. We will also discuss implementing one hot encoding in Python.

What is One Hot Encoding?

One hot encoding is an encoding technique in which we represent categorical values with numeric arrays of 0s and 1s. In one hot encoding, we use the following steps to encode categorical variables.

  • First, we find the number of unique values for a given categorical variable. The length of the array containing one-hot encoded values is equal to the total number of unique values for a given categorical variable.
  • Next, we assign an index in the array to each unique value.
  • For the one-hot array to represent a categorical value, we set the value in the array to 1 at the index associated with the categorical value. The rest of the values in the array remain 0.
  • We create a one-hot encoded array for each value in the categorical variable and assign them to the values.

One Hot Encoding Numerical Example

To understand how the above algorithm works, let us discuss a numerical example of one hot encoding. For this, we will use the following dataset.

NameCity
John SmithNew York
Aditya RajMumbai
Will SmithLondon
Harsh AryanLondon
Joel HarrisonMumbai
Bill WarnerParis
Chris KiteNew York
Sam AltmanLondon
JoeLondon
Dataset for one hot encoding

In the above table, suppose that we want to perform one-hot encoding on the City column. For this, we will use the following steps.

  • First, we will find the unique values in the given column. As there are four unique values London, Mumbai, New York, and Paris, the value will be 4. 
  • Next, we will create an array of length 4 with all 0s for each unique categorical value. So, the one-hot encoded arrays right now are as follows.
    • London=[0,0,0,0]
    • Mumbai=[0,0,0,0]
    • New York=[0,0,0,0]
    • Paris=[0,0,0,0]
  • After this, we will decide on the index associated with each categorical value in the array. Let us assign index 0 to London, 1 to Mumbai, 2 to New York, and 3 to Paris
  • Next, we will set the element at the associated index of each categorical value to 1 in the one-hot encoded array. Hence, the one-hot encoded arrays will look as follows.
    • London=[1, 0, 0, 0]
    • Mumbai=[0, 1, 0, 0]
    • New York=[0, 0, 1, 0]
    • Paris=[0, 0, 0, 1]

The above one-hot encoded arrays represent the associated categorical value. For example, the array [1, 0, 0, 0] represents the value London, [0, 1, 0, 0] represents the value Mumbai, and so on.

In most cases, these one-hot encoded arrays are split into different columns in the dataset. Here, each column represents a unique categorical value as shown below.

NameCityCity_LondonCity_MumbaiCity_New YorkCity_Paris
John SmithNew York0010
Aditya RajMumbai0100
Will SmithLondon1000
Harsh AryanLondon1000
Joel HarrisonMumbai0100
Bill WarnerParis0001
Chris KiteNew York0010
Sam AltmanLondon1000
JoeLondon1000
One Hot encoded data

In the above table, you can observe that we have split the one-hot encoded arrays into columns. In the new columns, the value is set to 1 if the row represents a particular value. Otherwise, it is set to 0. For instance, the City_London column for the rows in which City is London is set to 1 and all the other columns are 0.

One Hot Encoding in Python Using The sklearn Module

Now, that we have discussed how to perform one hot encoding, we will implement it in Python. For this, we will use the OneHotEncoder() function defined in sklearn.preprocessing module.

The OneHotEncoder() Function

The OneHotEncoder() function has the following syntax.

OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)

Here, 

  • The categories parameter is used to specify the unique values in the input data. By default, it is set to “auto”. Hence, the encoder finds all the unique values themselves. If you want to specify the unique values manually, you can pass a list of all the unique values in the categorical data to the categories parameter as its input. The passed values should not mix strings and numeric values within a single feature and should be sorted in the case of numeric values.
  • We use the drop parameter to reduce the length of one-hot encoded vectors. From the input data, we can represent one unique value with a vector containing all 0s. By this, we can reduce the size of the one-hot encoded vector by one. By default, the drop parameter is set to None. Hence, all the values are retained.
    • You can set the drop parameter to ‘first’ to drop the first categorical value. The first categorical value will then be represented by a vector containing all zeros. If only one category is present, the value will be dropped entirely.
    •  If we set the drop parameter to ‘if_binary’, the encoder drops the first value in the case of binary variables. Features with 1 or more than 2 categories are left intact.
  • The sparse parameter has been deprecated and will be removed in the next versions of sklearn. When we set the sparse parameter to True, the one-hot encoded values are generated in the form of a sparse matrix. Otherwise, we get an array. The sparse_output parameter is the new name for the sparse parameter.
  • The dtype parameter is used to specify the desired data type of the output. By default, it is set to float64. You can change it to any number type such as int32, int64, etc.
  • The min_frequency parameter is used to specify the minimum frequency below which a category will be considered infrequent. You can pass an integer as the absolute support count or a floating point number to specify the minimum support to decide the infrequent values.
  • The max_categories parameter specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, the max_categories parameter includes the category representing the infrequent categories along with the frequent categories. If we set the max_categories parameter to None, there is no limit to the number of output features.
  • The handle_unknown parameter is used to handle unknown values while generating one-hot encoding using the transform() method.
    • By default, the handle_unknown parameter is set to error. Hence, if the data given to the transform() method contains new values compared to the data given to the fit() method, the program runs into an error.
    • You can set the handle_unknown parameter to “ignore”. After this, if an unknown value is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
    • You can also set the one-hot encoded values of new values to existing infrequent values.  For this, you can set the handle_unknown parameter to ‘infrequent_if_exist’. After this, if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘infrequent‘ if it exists.
    • If the ‘infrequent‘ category does not exist, then transform() method and inverse_transform() method will handle an unknown category as with handle_unknown='ignore'. Infrequent categories exist based on min_frequency and max_categories

After execution, the OneHotEncoder() function returns an untrained one-hot encoder created using the sklearn module in Python. We can then train the encoder using the fit() method. If we want to encode values from a single attribute, the fit() method takes a numpy array of shapes (-1, 1). After execution, it returns a trained OneHotEncoder object. 

We can use the transform() method to predict one hot encoded value using the trained OneHotEncoder object. The transform() method takes the array containing the values for which we need to predict encoded values and returns a sparse array. You can convert the sparse array to one-hot encoded array using the toarray() method as shown below.

from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=np.array(["New York", "Mumbai", "London", "Paris"]).reshape(-1, 1)
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
input_values=np.array(["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]).reshape(-1, 1)
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)

Output:

The training set is:
[['New York']
 ['Mumbai']
 ['London']
 ['Paris']]
The input values are:
[['New York']
 ['Mumbai']
 ['London']
 ['London']
 ['Mumbai']
 ['Paris']
 ['New York']
 ['London']
 ['London']]
The output is:
[[0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]

In the above example, we have trained an one hot encoder using four values. Then, we passed a list of values to predict the one hot encoded arrays. In the output, you can observe that the arrays are in the same format we discussed in the numerical example.

You can also perform one hot encoding on multiple features using a single OneHotEncoder object. For this, you can simply pass the 2-D list containing all the rows and columns as input to the fit() method as shown below.

from sklearn.preprocessing import OneHotEncoder
import numpy as np
untrained_encoder = OneHotEncoder(handle_unknown='ignore')
cities=[[0,"Mumbai"],[1,"London"],[2,"Paris"],[3,"New York"]]
print("The training set is:")
print(cities)
trained_encoder=untrained_encoder.fit(cities)
print(trained_encoder.categories_)
input_values=[[1,"Mumbai"],[2,"New York"]]
output=trained_encoder.transform(input_values).toarray()
print("The input values are:")
print(input_values)
print("The output is:")
print(output)

Output:

The training set is:
[[0, 'Mumbai'], [1, 'London'], [2, 'Paris'], [3, 'New York']]
[array([0, 1, 2, 3], dtype=object), array(['London', 'Mumbai', 'New York', 'Paris'], dtype=object)]
The input values are:
[[1, 'Mumbai'], [2, 'New York']]
The output is:
[[0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 1. 0.]]

In this example, we passed a two-dimensional array to the fit() method to perform one hot encoding in Python. Here, the first element of each internal array is considered to belong to a single feature and the second element of each internal array belongs to another feature.

The number of elements in the output array depends on the unique value in both features. As there are 4 unique values in the first feature and four unique values in the second feature, the one hot encoded arrays contain eight elements.

One Hot Encoding on a Pandas DataFrame in Python

In the previous examples, we discussed how to perform one hot encoding on 1-d and 2-d arrays containing standalone values. This is of least use to us as we handle most of the data using pandas dataframes while creating machine learning applications. Hence, let us discuss how to perform one hot encoding on a pandas dataframe in Python. 

The process to train the one hot encoder is the same as discussed in the previous examples. We can extract a column from the dataframe and train the one hot encoder using the fit() method. After creating the encoder, we need to create a column transformer to generate one-hot encoded columns in the output dataframe. For this, we will use the make_column_transformer() function.

The make_column_transformer() Function

The make_column_transformer() function has the following syntax.

make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False, verbose_feature_names_out=True)

Here, 

  • The transformers parameter takes a tuple containing the trained OneHotEncoder object and a list of column names on which we want to perform one hot encoding.
  • By default, the remainder parameter is set to ‘drop’. Hence, only the specified columns in the transformers parameter are encoded and produced in the output. If we don’t specify a column name in the transformers parameter, they are dropped from the output. To avoid this, we can set the remainder parameter to ‘passthrough’. After this, all remaining columns that are not specified in the transformers parameter will be automatically passed through and included in the output. This subset of columns is concatenated with the output of the encoders. 
  • You can also pass an untrained OneHotEncoder to the remainder parameter. By setting the remainder parameter to be an encoder, the columns that are not specified in the transformers parameter are encoded using the remainder estimator. Here, the encoder that we pass to the remainder parameter must support fit() and transform() methods.
  • If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. We can set the sparse_threshold parameter to 0 to always return dense data. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and the sparse_threshold parameter is ignored.
  • The n_jobs parameter is used to run the one hot encoder in parallel. By default, it is set to None. It means that only one job will run. You can set it to -1 to run jobs as many as the number of processors in your machine.
  • The verbose parameter is used to print the time elapsed while fitting each encoder. By default, it is set to False.
  • The get_feature_names_out parameter is used to prefix all feature names with the name of the transformer that generated that feature in the one-hot encoded dataframe. By default, it is set to True. If we set it to False, get_feature_names_out will not prefix any feature names and the program will run into an error if the feature names are not unique.

To perform one-hot encoding on the columns of a pandas dataframe, we can create a transformer using the make_column_transformer() function. Then, we will invoke the fit() method on the transformer and pass the input dataframe to the fit() method. After this, we can use the transform() method to generate the array containing the one hot encoded values.

To convert the array into a dataframe, we will use the DataFrame() function defined in the pandas module.  We will also use the get_feature_names_out() method on the trained transformer to get the column names for the one-hot encoded data. You can observe this in the following example.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough')
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)

Output:

The dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
  onehotencoder__City_London onehotencoder__City_Mumbai  \
0                        0.0                        0.0   
1                        0.0                        1.0   
2                        1.0                        0.0   
3                        1.0                        0.0   
4                        0.0                        1.0   
5                        0.0                        0.0   
6                        0.0                        0.0   
7                        1.0                        0.0   
8                        1.0                        0.0   

  onehotencoder__City_New York onehotencoder__City_Paris remainder__Name  
0                          1.0                       0.0      John Smith  
1                          0.0                       0.0      Aditya Raj  
2                          0.0                       0.0      Will Smith  
3                          0.0                       0.0     Harsh Aryan  
4                          0.0                       0.0   Joel Harrison  
5                          0.0                       1.0     Bill Warner  
6                          1.0                       0.0      Chris Kite  
7                          0.0                       0.0      Sam Altman  
8                          0.0                       0.0             Joe  

In the above example, we have encoded the City column of the input dataframe using one-hot encoding in Python. For this, we first trained the OneHotEncoder using the column data and then we transformed the input dataframe using the make_column_transformer() function. Here, we passed a tuple containing the trained OneHotEncoder object and a list of column names as the first input argument to the make_column_transformer() function. In the above output, you can observe that the column names in the output dataframe look dirty as they all contain the transformer names.

You can set the get_feature_names_out parameter to False to generate clean output column names as shown below.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
values=np.array(df["City"]).reshape(-1,1)
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)

Output:

The dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
  City_London City_Mumbai City_New York City_Paris           Name
0         0.0         0.0           1.0        0.0     John Smith
1         0.0         1.0           0.0        0.0     Aditya Raj
2         1.0         0.0           0.0        0.0     Will Smith
3         1.0         0.0           0.0        0.0    Harsh Aryan
4         0.0         1.0           0.0        0.0  Joel Harrison
5         0.0         0.0           0.0        1.0    Bill Warner
6         0.0         0.0           1.0        0.0     Chris Kite
7         1.0         0.0           0.0        0.0     Sam Altman
8         1.0         0.0           0.0        0.0            Joe

In this example, we have set the verbose_feature_names_out parameter to False in the make_column_transformer() function. Hence, we get the output dataframe with the desired column names.

One Hot Encoding With Multiple Columns of the Pandas Dataframe

To perform one hot encoding on multiple columns in the pandas dataframe at once, we will first obtain values from all the columns and train the one hot encoder. Then, we will pass multiple column names in the list of column names passed to the transformers parameter in the make_column_transformer() function. After this, we can train the column transformer and perform one hot encoding on multiple columns in the pandas dataframe as shown below.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
df["Grades"]=["A","C", "B", "A", "A","B","B","C","D"]
print(df)
values=df[["City", "Grades"]]
untrained_encoder_object = OneHotEncoder()
trained_encoder=untrained_encoder_object.fit(values)
untrained_transformer = make_column_transformer((trained_encoder, ["City", "Grades"]), remainder='passthrough',verbose_feature_names_out=False)
trained_transformer=untrained_transformer.fit(df)
transformed_data=trained_transformer.transform(df)
output=pd.DataFrame(transformed_data, columns=trained_transformer.get_feature_names_out())
print("The output dataframe is:")
print(output)

Output:

The dataframe is:
            Name      City Grades
0     John Smith  New York      A
1     Aditya Raj    Mumbai      C
2     Will Smith    London      B
3    Harsh Aryan    London      A
4  Joel Harrison    Mumbai      A
5    Bill Warner     Paris      B
6     Chris Kite  New York      B
7     Sam Altman    London      C
8            Joe    London      D
The output dataframe is:
  City_London City_Mumbai City_New York City_Paris Grades_A Grades_B Grades_C  \
0         0.0         0.0           1.0        0.0      1.0      0.0      0.0   
1         0.0         1.0           0.0        0.0      0.0      0.0      1.0   
2         1.0         0.0           0.0        0.0      0.0      1.0      0.0   
3         1.0         0.0           0.0        0.0      1.0      0.0      0.0   
4         0.0         1.0           0.0        0.0      1.0      0.0      0.0   
5         0.0         0.0           0.0        1.0      0.0      1.0      0.0   
6         0.0         0.0           1.0        0.0      0.0      1.0      0.0   
7         1.0         0.0           0.0        0.0      0.0      0.0      1.0   
8         1.0         0.0           0.0        0.0      0.0      0.0      0.0   

  Grades_D           Name  
0      0.0     John Smith  
1      0.0     Aditya Raj  
2      0.0     Will Smith  
3      0.0    Harsh Aryan  
4      0.0  Joel Harrison  
5      0.0    Bill Warner  
6      0.0     Chris Kite  
7      0.0     Sam Altman  
8      1.0            Joe  

In the above output, you can observe that the City Column and Grade columns are encoded using one-hot encoding in python in a single execution.

Conclusion

In this article, we discussed one hot encoding in Python. We also discussed different implementations of the one hot encoding process using the sklearn module. To learn more about encoding techniques, you can read this article on label encoding in Python. You might also like this article on k-means clustering in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post One Hot Encoding in Python appeared first on Coding Infinite.

]]>
Implement Label Encoding in Python and PySpark https://codinginfinite.com/implement-label-encoding-in-python-and-pyspark/ Sun, 18 Jun 2023 12:56:09 +0000 https://codinginfinite.com/?p=4867 To analyze categorical data, we often need to convert them into numerical values. Label encoding is one of the most straightforward data preprocessing techniques for encoding categorical data into numeric values. This article will discuss different ways to perform label encoding in Python and pyspark. How to Perform Label Encoding? To perform label encoding, we...

The post Implement Label Encoding in Python and PySpark appeared first on Coding Infinite.

]]>
To analyze categorical data, we often need to convert them into numerical values. Label encoding is one of the most straightforward data preprocessing techniques for encoding categorical data into numeric values. This article will discuss different ways to perform label encoding in Python and pyspark.

How to Perform Label Encoding?

To perform label encoding, we just need to assign unique numeric values to each definite value in the dataset. For instance, consider the following dataset.

NameCity
John SmithNew York
Aditya RajMumbai
Will SmithLondon
Harsh AryanLondon
Joel HarrisonMumbai
Bill WarnerParis
Chris KiteNew York
Sam AltmanLondon
JoeLondon
Input Data for label encoding

Now, if we have to perform label encoding on the City column in the above table, we will assign a unique numeric value to each city name. For example, we can assign the value 0 to New York, 1 to Mumbai, 2 to London, and 3 to Paris. After this, we will replace the City names with the numeric values as shown below.

NameCity
John Smith0
Aditya Raj1
Will Smith2
Harsh Aryan2
Joel Harrison1
Bill Warner3
Chris Kite0
Sam Altman2
Joe2
Label Encoded Data

Thus, we have assigned numeric labels to each city name using label encoding. Now, let us discuss different ways to perform label encoding in Python.

Label Encoding in Python Using the Sklearn Module

The sklearn module provides us with the LabelEncoder() function to perform label encoding in Python. To perform label encoding using the sklearn module in Python, we will use the following steps.

  • First, we will create an empty LabelEncoder object by executing the LabelEncoder() function. 
  • Then, we will train the LabelEncoder object using the fit() method. The fit() method takes the list containing categorical values and learns all the unique values. After execution, it returns a trained LabelEncoder object. 
  • Next, we can perform label encoding by invoking the transform() method on the trained LabelEncoder object. The transform() method takes the input column of categorical values as its input argument and returns a numpy array containing a numeric label for each value in the input. 

You can observe this in the following example.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]

In the output, you can observe that the categorical values have been assigned numerical labels in alphabetical order. Hence, London is assigned the value 0, Mumbai is assigned the value 1, New York has the value 3 and Paris is assigned the value 4.

Instead of using fit() and transform() methods separately, you can also use the fit_transform() method on the untrained LabelEncoder object to perform label encoding as shown below.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
encoded_values=untrained_encoder_object.fit_transform(cities)
print("The label encoded values are:")
print(encoded_values)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]

In this example, we have directly generated label encoding using the fit_transform() method.

Generate Categorical Values From Label Encoded Data

You can also extract the original categorical data from the label-encoded values. For this, you can use the inverse_transform() method. The inverse_transform() method, when invoked on a trained LabelEncoder object, takes a list of numeric values as its input. After execution, it returns the original categorical values corresponding to the numeric values. You can observe this in the following example.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
new_codes=[1,1,1,1,2,0,1]
print("The input coded values are:")
print(new_codes)
original_values=trained_encoder_object.inverse_transform(new_codes)
print("The original values corresposding to the codes are:")
print(original_values)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
The input coded values are:
[1, 1, 1, 1, 2, 0, 1]
The original values corresposding to the codes are:
['Mumbai' 'Mumbai' 'Mumbai' 'Mumbai' 'New York' 'London' 'Mumbai']

In this example, we first trained the LabelEncoder object. After this, when we pass numeric values to the inverse_transform() method, it returns a list of original values that we used while training the encoder.

You can also find all the unique categorical values in the input data using the classes_ attribute of the trained LabelEncoder object as shown in the following example.

from sklearn import preprocessing
cities=["New York", "Mumbai", "London", "London","Mumbai","Paris","New York","London", "London"]
print("The input list of categorical values is:")
print(cities)
untrained_encoder_object = preprocessing.LabelEncoder()
trained_encoder_object=untrained_encoder_object.fit(cities)
encoded_values=trained_encoder_object.transform(cities)
print("The label encoded values are:")
print(encoded_values)
print("The unique categorical values in the input are:")
print(trained_encoder_object.classes_)

Output:

The input list of categorical values is:
['New York', 'Mumbai', 'London', 'London', 'Mumbai', 'Paris', 'New York', 'London', 'London']
The label encoded values are:
[2 1 0 0 1 3 2 0 0]
The unique categorical values in the input are:
['London' 'Mumbai' 'New York' 'Paris']

Normally, we use label encoding on the column of a dataframe in Python. To perform label encoding on a dataframe column, we will first generate label-encoded values by passing the column as input to the fit_transform() method. Then, we will assign the encoded values to the column in the dataframe as shown below.

import pandas as pd
df=pd.read_csv("sample_file .csv")
print("The dataframe is:")
print(df)
from sklearn import preprocessing
untrained_encoder_object = preprocessing.LabelEncoder()
encoded_values=untrained_encoder_object.fit_transform(df["City"])
df["City"]=encoded_values
print("The output dataframe is:")
print(df)

Output:

The dataframe is:
            Name      City
0     John Smith  New York
1     Aditya Raj    Mumbai
2     Will Smith    London
3    Harsh Aryan    London
4  Joel Harrison    Mumbai
5    Bill Warner     Paris
6     Chris Kite  New York
7     Sam Altman    London
8            Joe    London
The output dataframe is:
            Name  City
0     John Smith     2
1     Aditya Raj     1
2     Will Smith     0
3    Harsh Aryan     0
4  Joel Harrison     1
5    Bill Warner     3
6     Chris Kite     2
7     Sam Altman     0
8            Joe     0

In the above example, the fit_transform() method returns a numpy array of numeric labels. When we assign the array to the dataframe column, the categorical values are replaced with numeric values.

Implement Label Encoding in PySpark

We don’t have a dedicated function to implement label encoding in pyspark. However, we can use the StringIndexer() function to perform label encoding using the following steps.

  • First, we will create a StringIndexer object using the StringIndexer() function. The StringIndexer() function takes the name of the column that we want to encode as its input argument for the inputCol parameter. It also takes the name of the new column to be created using the encoded values in its outputCol parameter. Here, we will pass “City” as input to the inputCol parameter and “City_label” as input to the outputCol parameter. 
  • Next, we will train the StringIndexer object using the fit() method. The fit() method takes the dataframe as its input and returns a trained StringIndexer object. 
  • Next, we will use the transform() method to perform label encoding. For this, we will invoke the transform() method on the StringIndexer object and pass the dataframe as its input. After this, we will get label-encoded values in our new column.
  • Finally, we will drop the original City column using the drop() method and rename the City_label column to City using the withColumnRenamed() method.

After executing the above steps, we will get the output data frame with label-encoded values as shown below.

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("label_encoding_example") \
      .getOrCreate() 
dfs=spark.read.csv("sample_file .csv",header=True)
print("The input dataframe is:")
dfs.show()
indexer = StringIndexer(inputCol="City", outputCol="City_label") 
indexed_df = indexer.fit(dfs).transform(dfs)
indexed_df=indexed_df.drop("City").withColumnRenamed("City_label","City")
print("The output dataframe is:")
indexed_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------------+--------+
|         Name|    City|
+-------------+--------+
|   John Smith|New York|
|   Aditya Raj|  Mumbai|
|   Will Smith|  London|
|  Harsh Aryan|  London|
|Joel Harrison|  Mumbai|
|  Bill Warner|   Paris|
|   Chris Kite|New York|
|   Sam Altman|  London|
|          Joe|  London|
+-------------+--------+

The output dataframe is:
+-------------+----+
|         Name|City|
+-------------+----+
|   John Smith| 2.0|
|   Aditya Raj| 1.0|
|   Will Smith| 0.0|
|  Harsh Aryan| 0.0|
|Joel Harrison| 1.0|
|  Bill Warner| 3.0|
|   Chris Kite| 2.0|
|   Sam Altman| 0.0|
|          Joe| 0.0|
+-------------+----+

Conclusion

In this article, we have discussed how to implement label encoding in Python using the sklearn module. We also discussed how to implement label encoding in PySpark. To learn more about machine learning topics, you can read this article on how to implement the fp-growth algorithm in Python. You might also like this article on the ECLAT algorithm numerical example.

I hope you enjoyed reading this article. Stay tuned for more informative articles. 

Happy learning!

The post Implement Label Encoding in Python and PySpark appeared first on Coding Infinite.

]]>
Implement FP Growth Algorithm in Python https://codinginfinite.com/implement-fp-growth-algorithm-in-python/ Thu, 08 Jun 2023 18:24:12 +0000 https://codinginfinite.com/?p=4858 Like the apriori algorithm, we also use the fp-growth algorithm to generate frequent itemsets from a transaction dataset in market basket analysis. This article will discuss how to implement the fp growth algorithm in Python. How to Implement The FP Growth Algorithm in Python? We will use the mlxtend module in Python to implement the...

The post Implement FP Growth Algorithm in Python appeared first on Coding Infinite.

]]>
Like the apriori algorithm, we also use the fp-growth algorithm to generate frequent itemsets from a transaction dataset in market basket analysis. This article will discuss how to implement the fp growth algorithm in Python.

How to Implement The FP Growth Algorithm in Python?

We will use the mlxtend module in Python to implement the fp growth algorithm. It provides us with the fpgrowth() function to calculate the frequent itemsets and the association_rules() function for association rule mining.

Before implementing the fp growth algorithm, I suggest you read this article on the fp growth algorithm numerical example. This will help you understand how the algorithm actually works. 

Now, let us proceed with the implementation of the fp growth algorithm in Python. For this, we will use the following steps. 

  • First, we will obtain a list containing the lists of items in each transaction from the transaction dataset. 
  • Next, we will use the TransactionEncoder() function to create a transaction array. 
  • Once we get the transaction array, we will use the fpgrowth() function to generate frequent itemsets. 
  • Finally, we will use the association_rules() function to generate association rules.

Create a List of Lists of Items From The Transaction Dataset

For implementing the fp-growth algorithm in Python, we will use the following dataset. 

Transaction IDItems
T1I1, I3, I4
T2I2, I3, I5, I6
T3I1, I2, I3, I5
T4I2, I5
T5I1, I3, I5
FP-Growth Algorithm Dataset

The above transaction datasets contain five transactions having 6 unique items. We will convert the above table into a list of lists of items as shown below.

transactions=[["I1", "I3", "I4"],
            	 ["I2", "I3", "I5", "I6"],
             	["I1", "I2", "I3", "I5"],
             	["I2", "I5"],
             	["I1", "I3", "I5"]]

In the above list, items in each transaction constitute an inner list. We will use this list of lists to create the transaction array

Create Transaction Array Using TransactionEncoder() Function

The fpgrowth() function takes a transaction array as its input. Hence, we will convert the list of items in the transactions to a transaction array. The transaction array has the following features. 

  • Rows in the transaction array represent a transaction and each columns represent items.
  • If an item is present in a transaction, the element at the corresponding row and column will be set to True.
  • If an item isn’t present in a transaction, the element corresponding to the particular row and column is set to False. 

We will use the TransactionEncoder() function defined in the mlxtend module to generate the transaction array. The TransactionEncoder() function returns a TransactionEncoder object. After creating the TransactionEncoder object, we will use the fit() and transform() methods to create the transaction array.  

The fit() method takes the transaction data in the form of a list of lists. Then, the TransactionEncoder object learns all the unique labels in the dataset. Next, we use the transform() method to transform the input dataset into a one-hot encoded boolean array as shown in the following example.

from mlxtend.preprocessing import TransactionEncoder
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
print("The list of transactions is:")
print(transactions)
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
print("The transaction array is:")
print(transaction_array)

Output:

The list of transactions is:
[['I1', 'I3', 'I4'], ['I2', 'I3', 'I5', 'I6'], ['I1', 'I2', 'I3', 'I5'], ['I2', 'I5'], ['I1', 'I3', 'I5']]
The transaction array is:
[[ True False  True  True False False]
 [False  True  True False  True  True]
 [ True  True  True False  True False]
 [False  True False False  True False]
 [ True False  True False  True False]]

We will convert the transaction array into a dataframe using the DataFrame() function defined in the pandas module in Python. Here, we will set the item names as the column names in the dataframe. The transaction IDs will constitute the index of the dataframe. You can obtain all the item names using the columns_ attribute of the TransactionEncoder object and create the dataframe as shown below.

from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
print("The transaction dataframe is:")
print(transaction_dataframe)

Output:

The transaction dataframe is:
       I1     I2     I3     I4     I5     I6
T1   True  False   True   True  False  False
T2  False   True   True  False   True   True
T3   True   True   True  False   True  False
T4  False   True  False  False   True  False
T5   True  False   True  False   True  False

Once we get the transaction array in the form of a dataframe, we will use it to implement the fp growth algorithm in Python.

Generate Frequent Itemsets Using the fpgrowth() Function

After generating the transaction array using the transaction encoder, we will use the fpgrowth() function to implement the fp growth algorithm in Python. The fpgrowth() function has the following syntax.

fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)

Here, 

  • The df parameter takes the dataframe containing the transaction matrix as its input. 
  • The min_support parameter takes the minimum support that we want to specify for each item set. It should be a floating point number between 0 and 1. By default, it has a value of 0.5.
  • use_colnames parameter is used to specify if we want to use the column names of the input df as the item names. By default, use_colnames is set to False. Due to this, the fpgrowth() function uses the index of the columns instead of the column names as item names. To use the column names of the input df as item names, we will set the use_colnames parameter to True.
  • We use the max_len parameter to define the maximum number of items in an itemset. By default, it is set to None denoting that all possible itemsets lengths are evaluated.
  • We can use the verbose parameter to show the execution stage for the fp growth algorithm. You can set the verbose parameter to a value greater than 1 to show the number of iterations when the low_memory parameter is True.  When the verbose parameter is set to 1 and low_memory is set to False, the function shows the number of combinations while executing the apriori algorithm.

After execution, the fpgrowth() function returns a dataframe with columns 'support', and 'itemsets' having all the itemsets that have the support greater than or equal to the min_support and length less than max_len if max_len is not None. 

To calculate the frequent itemsets, we will use a support of 0.4 and set the use_colnames parameter to True to use the column names of the input dataframe as item names as shown below. 

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.4, use_colnames=True)
print("The frequent itemsets are:")
print(freuent_itemsets)

Output:

The frequent itemsets are:
    support      itemsets
0       0.8          (I3)
1       0.6          (I1)
2       0.8          (I5)
3       0.6          (I2)
4       0.6      (I5, I3)
5       0.6      (I1, I3)
6       0.4      (I1, I5)
7       0.4  (I1, I5, I3)
8       0.6      (I5, I2)
9       0.4      (I3, I2)
10      0.4  (I5, I3, I2)

In the above output, you can observe that the fpgrowth() function returns a dataframe containing the frequent itemsets and their support.

Generate Association Rules Using The association_rules() Function

After generating the frequent itemsets using the fpgrowth() function, we can use the association_rules() function to find association rules in the dataset. The association_rules() function has the following syntax. 

association_rules(df, metric='confidence', min_threshold=0.8, support_only=False)

Here, 

  • The df parameter takes the dataframe returned from the fpgrowth() function as its input. The dataframe must contain the columns 'support‘ and ‘itemsets‘ containing frequent itemsets and their support.
  • The metric parameter defines the metric used to select the association rules. We can specify any of the following metrics.
    • “support”: Support for an association rule is calculated as the sum of support of the antecedent and the consequent. It has a range of [0,1].
    • “confidence”: The confidence of an association rule is calculated as the support of the antecedent and consequent combined divided by the support of the antecedent.  It has a range of [0,1].
    • “lift”: The lift for an association rule is defined as the confidence of the association rule divided by the support of the consequent. It has a range of [0, infinity].
    • “leverage”: The leverage of an association rule is defined as the ratio of support of the association rule to the product of support of antecedent and consequent. It has a range of [-1,1].
    • “conviction”: The conviction of an association rule is defined as (1-support of consequent) divided by (1- confidence of the association rule). It has a range of [0, infinity].
    • “zhangs_metric”: It is calculated as leverage of the association rule/max (support of the association rule*(1-support of the antecedent), support of the antecedent*(support of the consequent-support of the association rule)). It has a range of [-1,1].
  • We use the min_threshold parameter to specify the minimum value of the metric defined in the metric parameter to filter the useful association rules. By default, it has a value of 0.8.
  • We use the  support_only parameter to specify if we only want to compute the support of the association rules and fill the other metric columns with NaNs. You can use this parameter if the input dataframe is incomplete and does not contain support values for all rule antecedents and consequents. By setting the support_only parameter to True, you can also speed up the computation because you don’t calculate the other metrics for the association rules.

After execution, the association_rules() function returns a dataframe containing the ‘antecedents’, ‘consequents’, ‘antecedent support’, ‘consequent support’, ‘support’, ‘confidence’, ‘lift’, ‘leverage’, and ‘conviction’ for all the generated association rules.  You can observe this in the following example.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth,association_rules
import pandas as pd
transactions=[["I1", "I3", "I4"],
             ["I2", "I3", "I5", "I6"],
             ["I1", "I2", "I3", "I5"],
             ["I2", "I5"],
             ["I1", "I3", "I5"]]
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_,index=["T1","T2","T3","T4","T5"])
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.4, use_colnames=True)
association_rules_df=association_rules(freuent_itemsets, metric="confidence", min_threshold=.7)
print("The association rules are:")
print(association_rules_df)

Output:

The association rules are:
  antecedents consequents  antecedent support  consequent support  support  \
0        (I5)        (I3)                 0.8                 0.8      0.6   
1        (I3)        (I5)                 0.8                 0.8      0.6   
2        (I1)        (I3)                 0.6                 0.8      0.6   
3        (I3)        (I1)                 0.8                 0.6      0.6   
4    (I1, I5)        (I3)                 0.4                 0.8      0.4   
5        (I5)        (I2)                 0.8                 0.6      0.6   
6        (I2)        (I5)                 0.6                 0.8      0.6   
7    (I3, I2)        (I5)                 0.4                 0.8      0.4   

   confidence    lift  leverage  conviction  
0        0.75  0.9375     -0.04         0.8  
1        0.75  0.9375     -0.04         0.8  
2        1.00  1.2500      0.12         inf  
3        0.75  1.2500      0.12         1.6  
4        1.00  1.2500      0.08         inf  
5        0.75  1.2500      0.12         1.6  
6        1.00  1.2500      0.12         inf  
7        1.00  1.2500      0.08         inf  

Now, that we have discussed each step for implementing the fp growth algorithm in Python, let us use a real dataset for the implementation. You can download the dataset using this link. For this article, I have downloaded and renamed the dataset to transaction_dataset.csv.

Implement FP Growth Algorithm in Python on Real Data

To implement the fp growth algorithm in Python on a real-world dataset, we will first load the dataset into our program using the read_csv() function defined in the pandas module. The read_csv() function takes the filename as its input argument and returns a dataframe containing the data in the file as shown below.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
print("The dataset is:")
print(dataset.head())

Output:

The dataset is:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  

You can observe that the input dataset contains eight columns.

Preprocess data to get the appropriate dataset

We can’t directly implement the fp-growth algorithm on the above dataset, hence, we need to perform data preprocessing to get a list of lists of items in the transactions.

From the above dataset, we only need the transaction id ie. “InvoiceNo” and item id i.e. “StockCode” columns. Hence, we will drop other columns in the dataframe using the drop() method. We will also drop the rows in which InvoiceNo or StockCode are null values.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
print("The dataset is:")
print(dataset.head())

Output:

The dataset is:
  InvoiceNo StockCode
0    536365    85123A
1    536365     71053
2    536365    84406B
3    536365    84029G
4    536365    84029E

In the above output, you can observe that each row contains only one item. Hence, we will group all the items of a particular transaction in a single row. For this, we will use the groupby() method, the apply() method, and the list() function.

  • The groupby() method, when invoked on a dataframe, takes the column name i.e. “InvoiceNo” as its input argument. After execution, it groups the rows for a particular InvoiceNo into small dataframes. 
  • Next, we will make a list of all items in the transaction by applying the list() function on the  “StockCode” column of each grouped dataframe using the apply() method.

After executing the above methods, we will get a dataframe containing the transaction id and the corresponding items as shown below.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
print("The transaction data is:")
print(transaction_data.head())

Output:

The transaction data is:
  InvoiceNo                                              Items
0    536365  [85123A, 71053, 84406B, 84029G, 84029E, 22752,...
1    536366                                     [22633, 22632]
2    536367  [84879, 22745, 22748, 22749, 22310, 84969, 226...
3    536368                       [22960, 22913, 22912, 22914]
4    536369                                            [21756]

Finally, we will select the Items column from the dataframe to create a list of lists of items in each transaction using the tolist() method as shown below.

import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
print("The transactions are:")
print(transactions[0:10])

Output:

The transactions are:
[['85123A', '71053', '84406B', '84029G', '84029E', '22752', '21730'], ['22633', '22632'], ['84879', '22745', '22748', '22749', '22310', '84969', '22623', '22622', '21754', '21755', '21777', '48187'], ['22960', '22913', '22912', '22914'], ['21756'], ['22728', '22727', '22726', '21724', '21883', '10002', '21791', '21035', '22326', '22629', '22659', '22631', '22661', '21731', '22900', '21913', '22540', '22544', '22492', 'POST'], ['22086'], ['22632', '22633'], ['85123A', '71053', '84406B', '20679', '37370', '21871', '21071', '21068', '82483', '82486', '82482', '82494L', '84029G', '84029E', '22752', '21730'], ['21258']]

At this step, we obtained the list of lists we can use to implement the fp growth algorithm in Python.

Generate Frequent Itemsets and Association Rules

To generate the frequent itemsets using the list of transactions. we will use the following steps.

  • First, we will use the TransactionEncoder() function to generate the transaction array.
  • Next, we will use the fpgrowth() function to obtain the frequent itemsets. Here, we will use the minimum support of 0.02 and set the use_colnames parameter to True to use the column names of the input dataframe.

After execution of the fpgrowth() function, we will get the frequent itemsets with their support as shown below.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_)
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.02, use_colnames=True)
print("The frequent itemsets are:")
print(freuent_itemsets)

Output:

The frequent itemsets are:
      support               itemsets
0    0.086718               (85123A)
1    0.056680                (84879)
2    0.030386                (21754)
3    0.024363                (21755)
4    0.023243                (48187)
..        ...                    ...
215  0.021197  (22697, 22699, 22698)
216  0.021429        (23199, 85099B)
217  0.020000         (23203, 23202)
218  0.022471        (23203, 85099B)
219  0.021197         (23300, 23301)

[220 rows x 2 columns]

You can also find all the association rules using the association_rules() function as shown below.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth,association_rules
import pandas as pd
dataset=pd.read_csv("ecommerce_transaction_dataset.csv")
dataset=dataset.drop(["Description","Quantity","InvoiceDate","UnitPrice","CustomerID","Country"],axis=1)
dataset=dataset.dropna()
transaction_data=dataset.groupby("InvoiceNo")["StockCode"].apply(list).reset_index(name='Items')
transactions=transaction_data["Items"].tolist()
transaction_encoder = TransactionEncoder()
transaction_array = transaction_encoder.fit(transactions).transform(transactions)
transaction_dataframe = pd.DataFrame(transaction_array, columns=transaction_encoder.columns_)
freuent_itemsets=fpgrowth(transaction_dataframe,min_support=0.02, use_colnames=True)
association_rules_df=association_rules(freuent_itemsets, metric="confidence", min_threshold=.50)
print("The association rules are:")
print(association_rules_df.head())

Output:

The association rules are:
  antecedents consequents  antecedent support  consequent support   support  \
0     (22727)     (22726)            0.041737            0.038726  0.024942   
1     (22726)     (22727)            0.038726            0.041737  0.024942   
2     (22386)    (85099B)            0.047529            0.082432  0.032162   
3     (21931)    (85099B)            0.046371            0.082432  0.028301   
4    (85099C)    (85099B)            0.036564            0.082432  0.022896   

   confidence       lift  leverage  conviction  
0    0.597595  15.431412  0.023326    2.388821  
1    0.644068  15.431412  0.023326    2.692261  
2    0.676686   8.208973  0.028244    2.838004  
3    0.610325   7.403939  0.024479    2.354698  
4    0.626188   7.596379  0.019882    2.454623  

Conclusion

In this article, we have discussed how to implement the fp growth algorithm in Python. To know more about data mining, you can read this article on the apriori algorithm in Python. You might also like this article on categorical data encoding techniques

I hope you enjoyed this article. Stay tuned for more informative articles. 

Happy Learning!

The post Implement FP Growth Algorithm in Python appeared first on Coding Infinite.

]]>
Categorical Data Encoding Techniques Explained https://codinginfinite.com/categorical-data-encoding-techniques-explained/ Wed, 31 May 2023 13:00:00 +0000 https://codinginfinite.com/?p=4854 To analyze categorical data, we need to convert them into numerical format. In this article, we will discuss different encoding techniques for converting categorical data into numeric format. How to Convert Categorical Data into Numerical Data? You can use the following encoding techniques to convert categorical into numeric data.  Let us discuss all the categorical...

The post Categorical Data Encoding Techniques Explained appeared first on Coding Infinite.

]]>
To analyze categorical data, we need to convert them into numerical format. In this article, we will discuss different encoding techniques for converting categorical data into numeric format.

How to Convert Categorical Data into Numerical Data?

You can use the following encoding techniques to convert categorical into numeric data. 

  1. Label encoding
  2. One-hot encoding
  3. Target encoding
  4. Entity Encoding

Let us discuss all the categorical data encoding techniques one by one.

Label Encoding

Label encoding is one of the easiest techniques for converting categorical data into numeric format. In label encoding, we just assign a unique integer to each categorical value. For example, consider that we have the following data containing city names. 

Sl. No.City
1London
2Paris
3New York
4Mumbai
5New Delhi
Data For Label Encoding

Now, we will assign a unique integer starting from 0 to each city name as shown below.

Sl. No.CityNumeric Value
1London0
2Paris1
3New York2
4Mumbai3
5New Delhi4
Encoded Data

In the above table, we have used label encoding to convert the categorical data to a numeric format. Here, you can observe that the numeric labels have no meaning. For instance, we have the value 0 to London and 4 to New Delhi randomly. Even if we assign 0 to New Delhi and 4 to London, the meaning of the data won’t change. However, statistical and machine-learning algorithms might misinterpret these values giving the ranking 0 to London and 4 to New Delhi. Hence, Label encoding is of no use for nominal data types.  

We can use label encoding for ordinal data types. Ordinal categorical data has an intrinsic order. Due to this, if we assign numeric labels in order of the rankings of the categorical labels, we will get meaningful numeric labels.

For example, consider that we have the following categories of customer reviews.

Sl. No.Review Label
1Very Poor
2Poor
3Average
4Good
5Very Good
Ordinal Data for label encoding

Now, let us use label encoding to convert the categorical labels to numeric format as shown below.

Sl. No.Review LabelNumeric Value
1Very Poor0
2Poor1
3Average2
4Good3
5Very Good4
Encoded data

In the above table, we have labeled the categorical values in increasing order of the review label. Here, the numeric values have a specific meaning as the worst review has been assigned the value 0 and the best review has been assigned the value 4.

Hence, we can say that review with a value of 3 is better than a review with a value of 1. Thus, we can use label encoding to convert ordinal categorical data into numeric form without losing the meaning of the data labels.

One Hot Encoding

For encoding nominal data, one hot encoding is a better technique than label encoding. In one hot encoding, we transform the categories into an array of 0s and 1s. 

  • In the array, the number of columns is equal to the number of unique values in the categorical data. 
  • Each column in the array corresponds to a unique categorical variable and acts as a new variable. 
  • Each row in the array corresponds to a data point. 
  • To populate a cell corresponding to a particular row or column, we check if the variable corresponding to the current column is originally present in the current row. If yes, we set the current cell to 1. Otherwise, it is set to 0. 

To understand this, consider that we have the following data containing 10 rows and 5 unique values. 

Sl. No.City
1London
2Mumbai
3New York
4New Delhi
5Mumbai
6Paris
7New York
8Mumbai
9New Delhi
10London
Data For One-Hot Encoding

As there are 5 unique values in the City column, we will add 5 new columns to the dataset. Here, each column will correspond to a particular categorical variable as shown below.

Sl. No.CityCity_LondonCity_MumbaiCity_NewYorkCity_ParisCity_NewDelhi
1London
2Mumbai
3New York
4New Delhi
5Mumbai
6Paris
7New York
8Mumbai
9New Delhi
10London
Intermediate Table in One-Hot Encoding

In the above table, we will fill each cell. For this, we will set each cell to 1 if the city corresponding to the particular column is the same as the city given in the same row.  Otherwise, we will fill the value 0 in the given cell. After this, we will get the following table.

Sl. No.CityCity_LondonCity_MumbaiCity_NewYorkCity_ParisCity_NewDelhi
1London10000
2Mumbai01000
3New York00100
4New Delhi00001
5Mumbai01000
6Paris00010
7New York00100
8Mumbai01000
9New Delhi00001
10London10000
One-Hot encoded data

In the above table, we have converted the categorical variables into 5 columns with numerical values. Hence, each city or categorical variable is represented using a vector of 0s and 1s. For example, the categorical value London is represented using the vector [1,0,0,0,0]. 

Again, the new columns can have any order, and the vector corresponding to a particular categorical value can be different too. However, one-hot encoding doesn’t misrepresent the data by introducing any order in the numeric values, unlike the label encoding approach. 

Although one hot encoding solves the problem of misrepresentation of the values, it runs into another major problem. If there are a lot of unique values for a particular categorical attribute, the dataset will become sparse as we need to add as many columns as the number of unique categorical values.

For example, if a categorical attribute has 30 unique values, we will have to add 30 columns to the dataset. Also, if there are 5 categorical attributes with 30 unique values each, we need to add 30*5 i.e. 150 new columns in the dataset. Due to this, the dataset will become very sparse. Thus, one hot encoding introduces sparsity in the dataset, which is its major drawback. 

Suggested Reading: data visualization best practices

Target Encoding

As the name suggests, target encoding replaces a categorical variable with the mean or median of a target numeric variable. To understand this, consider the following dataset.

GradeMarks
A86
B75
A91
C65
A90
B71
A89
Data For target encoding

If we want to encode the categorical attribute Grade using Target Encoding, we will take the mean of Marks where the grade is A, B, and C separately.

  • The mean of the rows in the Marks column with grade A is 89.
  • The mean of the rows in the Marks column with grade B is 73.
  • The mean of the rows in the Marks column with grade C is 65.

Hence, we will impute the mean values in the place of grades as shown below.

GradeMarks
8986
7375
8991
6565
8990
7371
8989
target encoded data

Target encoding enables us to perform categorical data encoding easily if there is a numeric target attribute. However, if we don’t have a numeric target attribute, we can’t perform target encoding. 

Entity Embedding

Entity embedding is one of the most recent and advanced techniques for encoding categorical data. In entity encoding, we use neural networks to create numerical embedding for categorical values. Here, we first create a unique numerical embedding consisting of one or more columns for each unique value in the categorical column. The number of embedding columns that replaces the categorical column is decided using the unique values present in the categorical column.

For example, consider that we have the following dataset. 

GradeMarks
A86
B75
A91
C65
A90
B71
A89
Data for entity embedding

Now, there are 3 unique values in the Grade column. So, we can replace it with a numerical column using entity embedding as shown below.

GradeMarks
1.300062486
-0.51841475
1.300062491
-0.575640365
1.300062490
-0.51841471
1.300062489
entity embedded data

I understand that you might be guessing how we obtained the numeric values. To understand this, you can read this article on entity embedding in Python. 

Conclusion

In this article, we discussed different categorical data encoding techniques. To learn more about categorical data processing, you can read this article on KModes clustering in Python. You might also like this article on the apriori algorithm numerical example

I hope you enjoyed reading this article. Stay tuned for more informative articles. 

Happy learning!

The post Categorical Data Encoding Techniques Explained appeared first on Coding Infinite.

]]>
Categorical Data Explained With Examples https://codinginfinite.com/categorical-data-explained-with-examples/ Thu, 25 May 2023 07:17:00 +0000 https://codinginfinite.com/?p=4850 In data science, we work with data to produce insights that can help businesses solve problems. In real-world applications, most of the data is produced in a categorical format. Most of the attributes like gender, day of the week, names, etc can only be presented in textual or categorical format. On the contrary, most of...

The post Categorical Data Explained With Examples appeared first on Coding Infinite.

]]>
In data science, we work with data to produce insights that can help businesses solve problems. In real-world applications, most of the data is produced in a categorical format. Most of the attributes like gender, day of the week, names, etc can only be presented in textual or categorical format. On the contrary, most of the machine learning algorithms or statistical methods work only with numerical data. To process categorical data, we need to find a way to convert them into a numerical format. In this article, we will discuss different types of categorical data with examples and how we can convert them into numerical format.

What is Categorical Data?

Categorical data contains data points that represent distinct categories or groups and cannot be ordered or measured on a numeric scale. Based on their nature, we divide categorical data into the following types.

  1. Nominal data
  2. Ordinal data
  3. Binary or Dichotomous Data

In the following sections, we will discuss these categorical data types with examples. However, let us first discuss the features of categorical data.

Features of Categorical Data

A dataset containing categorical data has only strings and labels. Due to this, categorical data shows the following properties.

  • Categorical data often represents qualitative attributes. Examples may include gender, education, customer satisfaction level, proficiency level, etc.
  • We often need to convert categorical data into a numerical format using different encoding methods. However, we can still analyze categorical data directly for the probability of occurrence, frequency, etc. 
  • We can also visualize categorical data using bar charts and pie charts. We use a bar chart to analyze the frequency of values in categorical data. On the other hand, we can use a pie chart to analyze the probability or percentage of a categorical value in data.
  • We can also represent categorical data using numeric values. However, they impart no real meaning to the data and work only as a label. We cannot perform arithmetic operations on such data. In the case of ordinal data, values can represent the level of the data point.
  • Categorical data must have discrete and finite values. Also, each data point should contain only one value for a single attribute. It will be very hard to analyze the data if it contains an infinite number of values or if a data point contains two or more categorical values for a single attribute. 
  • Categorical data does not have a consistent unit of measurement or a fixed scale. The differences between categories are qualitative rather than quantitative. For example, the difference between “male” and “female” in a gender variable is not measurable in a numeric sense.

Different Types of Categorical Data

As discussed above, we can divide categorical data broadly into three categories. Let us discuss each of these one by one. 

Nominal Data 

Nominal data is used to represent names. We use nominal data to represent data containing brand names, colors, places, etc. The nominal values in a dataset have no particular order.

Ordinal data

As the name suggests, ordinal data represent categorical data that has some inherent order. Examples of ordinal data include the level of education, product ratings, customer satisfaction, etc.  We can represent ordinal data in the numeric format using the Likert scale.

Binary or Dichotomous Data

Binary or Dichotomous data includes data that can contain only two mutually exclusive values. Examples of binary data include values represented using Pass/Fail, Yes/NO, True/False, etc. 

Examples of Categorical Data

In real-world interactions, we use categorical data to represent data in various activities as shown in the following examples.

Brand Names

Brand names are represented using nominal data. For example, you can represent the brand names of mobile phones as shown in the following data.

Sl. No.Brand Name
1Samsung
2Motorola
3Nokia
4Apple
5Sony
Example of Nominal Categorical Data

Level of Education

We can represent the level of education using ordinal data as they have an inherent order. The following table contains different levels of education with their designated level from lowest to highest.

Sl. No.Level of Education
1Primary
2High School
3Under Graduate
4Post Graduate
5Doctorate
Ordinal Categorical Data Example

In the above table, the level of education from Primary to Doctorate can be represented in an order. Hence, it is an example of ordinal categorical data.

Interval Scales

In surveys, we often use interval scales to represent age, weight, marks, etc. The data represented using interval scales can be classified as ordinal data. For example, the following table contains different values for age in the interval scale.

Sl. No.Interval
10 to 18 Years
219 to 25 years
326 to 40 years
441 years and above
Ordinal Data Example

As you can observe, the above table classifies age into four categories using different intervals. Hence, we can say that these categories are examples of ordinal categorical data.

How to Convert Categorical Data to a Numeric Format for Analysis?

We cannot perform statistical analysis directly on the categorical data. Therefore, we need to convert the categorical data into numeric format. For this, we use different encoding techniques such as Label encoding, one-hot encoding, integer mapping, entity encoding, binary encoding, etc. All these methods to convert categorical data into numeric format have been discussed in this article on Encoding categorical data in Python.

Conclusion

In this article, we discussed categorical data, its types, and examples. To learn more about data mining and machine learning concepts, you can read this article on how to implement the apriori algorithm in Python. You might also like this article on data cleaning.

I hope you enjoyed reading this article. Stay tuned for more informative articles. 

Happy Learning!

The post Categorical Data Explained With Examples appeared first on Coding Infinite.

]]>