Handling Categorical Feature Variables in Machine Learning using Spark
Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms are Logistic Regression, Support Vector Machine (SVM) and any Regression algorithm.
In this post we will go over a Spark based solution to alleviate the problem. The solution implementation can be found in my open source projects chombo and avenir. We will be using CRM data as the use case
Categorical Feature Variable Problem
The underlying data type of a categorical variable is string. However the values are constrained as below.
- Has finite and unique set of values
- There is no ordering or any other relationship between the values
Some popular Machine Learning algorithms expect all the feature variables to be numeric.What do you do when when you have categorical feature variables in your training data set and you want to use one of those algorithms. As alluded to earlier, examples of such algorithms are Logistic Regression, Support Vector Machine (SVM).
Most Decision Tree algorithm implementations also can not handle categorical variables. My implementation of Decision Tree can handle categorical variables.
One popular solution is to have one numeric binary variable for each value of the categorical variable. For any particular value of the categorical variable, the binary variable in the corresponding position will be set to 1 and 0 in the rest of the positions. It’s also known as One Hot Encoding. Each categorical variable value is replaced with a binary vector, with only one element set to 1 and the rest to 0.
Let’s consider the categorical variable color with value set (red, green, blue, yellow, brown, violet). Since the cardinality is 5, we need 5 numerical binary variables, one for each of the values in the set. For the color yellow, the binary value set will be (0, 0, 0, 1, 0,0). Since yellow is the third value, only the third binary variable has been set to 1.
In this example, we have replaced one categorical variable with 5 binary variable. Effectively, we have added 4 additional feature variables in our training set.
The solution involves two steps. In the first step, we find all the unique values for all the categorical variables in the data set. If this information is already available, the first step is not necessary. In the second step, we generate the dummy binary variables as outlined earlier.
Sales Lead Use Case
We will be using sales lead data as gleaned from a hypothetical CRM system. The context is that some Data Scientist wants to build a predictive model, that will predict whether a sales lead will convert or not. The Data Scientist wants to use SVM for building the model.
The data set contains 12 variables, including the class variable. Among the feature variables, there are 4 categorical variables. The variables are as below.
- source of lead (categorical)
- lead contact type (categorical)
- lead company size (categorical)
- number of days in sales pipeline
- number of meetings with lead
- number of emails exchanged with the client
- number of web site visits by the lead
- number of demos shown to the client
- expected revenue from the deal
- proposal with price sent to the lead (categorical)
- converted (class label)
The first field which is an ID, will be obviously be not used for building the learning model. Excluding the first and the last field, there are 10 feature variables.
Here is some sample input data
Discovering Unique Values
The Spark job that finds all the unique values for categorical variables in implemented in scala object UniqueValueCounter. As mentioned before, if the unique values are already known, then running this job is not necessary.
The Spark job has 2 main steps. In the first step a map operation generates paired record, with column index as the key and a set containing the column value. The second step performs a reduce by key operation whereby the set of column values are merged.
Here is the output from this Spark job. The first field is the column index. The remaining fields are the unique values for the column.
There is a case insensitivity configuration parameter available. If set to true, all categorical variable values are converted to lower case before processing.
Dummy Binary Value Generation
This Spark job is implemented in the scala object BinaryDummyVariableGenerator. The unique value list for each categorical variable is provided through configuration. If they are not already known,
The Spark job has a map function, which for each categorical variable , creates as many binary fields as the number of unique values for that categorical variable. As in the first job, there is a case insensitivity configuration parameter available for this Spark job also.
Here is some sample output.The 4 categorical fields have been replaced with 11 binary fields.
High Cardinality Categorical Variables
What happens if there are categorical variables with high cardinality i.e too many unique values. With binary dummy variables approach or One Hot Encoding approach, too many new fields will be added and you will end up with an explosion of feature dimensions in your data set.
Too many feature dimensions is problematic for most Machine Learning algorithms. It’s also known as the curse of conditionality problem.
Binary encoding looks promising because it does not introduce as many new variables, but it’s faulty as we will find out soon. You choose the smallest n such that c < 2n where c is the number of values in the categorical variable. Then you convert each position of the values into a binary representation. With this scheme the categorical variable will be replaced with n binary variable.
Going back to example of color, n will be 3. the range of binary values based on position will be 0 through 5. The binary encoding for the color yellow will (0 1 1).
Although, this scheme introduces only 3 variables, instead of 6 as in simple binary variables, essentially we have assigned a numerical value to each value of a categorical variable. The numerical value happens to be represented with binary encoding.
We have essentially introduced a relationship and to be more specific an ordering between the values of a categorical variable values. This goes against the definition of categorical variables.
In Label Encoding, there are no additional fields. Each categorical value is replaced with a number. However, it is as bad as Binary Encoding and for the same reason, i.e. it artificially introduces an order between the values.
If your data set has class labels as in training data set for unsupervised machine learning, the categorical variable values can be replaced with a numerical value with the Supervised Ratio or Weight of Evidence algorithms. In both algorithms, the numerical value depends on the correlation between the categorical variable value and the class label.
You may face many pre processing steps before training data set is ready for building the machine learning model. The problem addressed in this article is one such example. The use case can be executed by following the steps in the tutorial document.
Originally posted here.