In my last post, we did some exploratory analytic for customer churn. We identified the parameters that have most influence on whether a customer account gets closed or not. We performed correlation analysis using Cramer index.
In this post, we will take the next step forward i.e., build a Bayesian prediction model for predicting customer churn. The Hadoop based Bayesian classifier implementation is part of my open source project avenir on github. We will use the same mobile service provider customer data as an example, as in the last post. New customer acquisition is expensive. So any predictive analytic solution, that can predict whether an existing customer will be closing his or her account or not is extremely valuable.
Bayesian classifier is a probabilistic prediction model founded on conditional probability. In our case, it predict the probability of a customer keeping his or her account open or closing it. The most well known application for Bayesian Classifier is spam filtering.
Every time you click on spam button in your email application, chances are that information gets used in building a Bayesian prediction model behind the scene. Bayesian classifiers are based on Bayes theorem, which uses conditional probability. It is defined as follows.
P(C|X) = P(X|C) * P (C) / P(X)
P(C|X) = Probability of class attribute value C given the feature value X , a.k.a class posterior probability
P(X|C) = Probability of feature attribute value X given the class attribute value C, a.k.a feature posterior probability
P(C) = Probability of the class attribute value C, a.k.a class prior probability
P(X) = Probability of of feature attribute value X, a.k.a feature prior probability
Feature attributes are attributes used to make a prediction. A class attribute is an attribute whose value is predicted based on all the feature attribute values. Since an entity has multiple features, the feature values form a vector of values.
P(X|C) is the joint probability of a vector of feature values conditioned by the class attribute value C. In the so called naive Bayes model, we assume the independence of the features. So the joint probability distribution turns into a product of individual probabilities, as below
P(X|C) = P(X(1)|C) P(X(2)|C) …P(X(n)|C)
P(X) = P(X(1)) P(X(2)) …P(X(n))
Referring to out example of mobile service provider customer, an example of P(X(i) | C) might be probability of data usage, given the account is closed. In our example P(C) is the probability of account being open or closed. P(X(j)) could be probability of minutes used. P(C|X) would be the probability of account status being open or closed given a set of account feature attribute values and this is what we are after.
Customer Account Data
As described in the earlier post, the account feature attributes are as follows. All the feature attributes are categorical.
- Minutes used
- Data used
- Customer Service Calls
- Payment history
- Account age
The class attribute is the account status. Our goal is to to predict the probability of account status being open or closed given a set of feature attribute values.
Any classification problem has two parts: training a model based on training data set and then the validation of the trained model. In our case the validation will consist of comparing the predicted account status with the actual actual account. The validation is done on a separate data set called the validation data set.
We will have one map reduce for training the model using the training data set and the second one to validate the model using the validation data set and the output of the first map reduce i.e., the Bayesian model.
Bayesian Model Map Reduce
This map reduce implemented by the class BayesianDistribution, takes the training data and essentially does counting for the the different probabilities needed as per Bayes theorem. These estimated probabilities will be close to the real probabilities when the sample size is large enough according to maximum likelihood principle. Here is some sample input for this map reduce.
VD6Z2289O2A1,med,med,med,average,1,open Q0GCO89ZJ1LV,med,high,low,good,1,closed 11Y0UP3KIB6V,med,low,low,average,2,closed 9N432K6QJGJO,overage,high,med,average,1,closed S40EG4GHTDGN,overage,low,low,average,3,closed 8FLU4G6KOHIE,high,med,low,average,2,open J3SHLF7AFEGN,med,med,low,good,4,open E0U2M1W5BMX0,med,low,med,poor,2,closed NMF0IWEIFBLL,overage,med,low,average,3,open
Counting is performed for the following
- Every possible set of class attribute and feature attribute value combination. It gets used for computing P(X|C)
- Every possible class attribute values. It gets used for computing P(C)
- Every possible set of feature attribute value combinations . It gets used for computing P (X)
Here is some sample output for the model. It contains various count values used for building the Bayesian model. Lot of insight can be gained by simple examination of these counts. For example, you will be able to detect the the unique combination of account feature attribute values that triggers account closing.
Each line has 4 fields and they are 1. class attribute value 2. feature attribute ordinal 3. feature attribute value 4. count. For numerical attribute, the attribute value is bin index as explained in a later section. The line with all values present, is used in computing P(X|C). A line with the first and fourth field only gets used in computing P(C). A line with the second, third and fourth field only gets used in computing P(X)
closed,5,4,1462 closed,,,1462 ,5,4,1462 open,1,high,1050 open,,,1050 ,1,high,1050 open,1,low,865 open,,,865
For categorical attribute, we count for each unique value of the attribute in the data. When the data is numerical, the traditional approach is to find the mean and standard deviation and fit a normal distribution. I have taken a different approach, instead force fitting an uni modal Gaussian distribution, which may not be most prudent choice in many cases. Using jargon, I am using a non parametric distribution model, rather than a parametric model.
I discretize the data, using the bin width as specified in the data schema. This way, I preserve, whatever the natural probability distribution for the attribute is. For example, minute usage distribution my not be uni modal, and there may three different peaks, corresponding low, medium and high usage.
The bin index goes into the output as the attribute value. The corresponding count is the count for that bin index.
Bayesian Prediction Map Reduce
This is another map reduce implementation, which has no reducer. It reads the output of the previous map reduce i.e., the different count values and build the Bayesian model. As a new record is processed in the mapper, it’s run through the model and the account status probabilities are generated. It’s implemented by the class BayesianPredictor.
Here is some sample output. The first part of the record is the original input from the validation data set. The field second from the last is the predicted account status and the last field is the associated probability, in a scale of 100. As you can see, in most cases the account status is correctly predicted. However in many cases the prediction is wrong.
7EEF3U8U26HI,high,med,low,average,2,open,open,51 MZFK7PA61W2K,med,low,low,good,4,closed,open,68 Q7A8A9C3A30C,high,high,low,good,1,closed,closed,56 C124V89X2M2F,high,med,low,poor,3,closed,closed,62 KA9T3KF0CW1B,med,med,low,good,1,open,open,72 WRQ1TJN96VWN,low,med,high,poor,1,open,closed,79
Here we have calculated probabilities of class attribute values (open and closed). The output includes only the class attribute value with the highest probability.
Using the Bayesian model, we calculate the probabilities of the different class attribute values e.g. open or closed. The value associated with the highest probability is chosen as the predicted value. The output consists of that value and the associated probability in the last two fields.
The prediction map reduce also generates additional metric as Hadoop counters. These metrics are based on the so called confusion matrix. If a class attribute has n possible values, then the confusion matrix is a n x n matrix where the columns are the predicted class values and the rows are the actual values. For a perfect classifier, only the diagonal elements of the confusion matrix will be non zero and the off diagonal elements will be zero. Here is a 2 x 2 confusion matrix.
|True Positive(TP)||False Negative(FN)|
|False Positive(FP)||True Negative(TN)|
For our example, account close is associated with positive class attribute. For any business use case, there are costs associated with inaccurate predictions as indicated by the off diagonal elements. However, the cost may not be symmetric.
For our example, false negative is costlier than false positive. In our case the false negative counts stands for the number of cases where the predicted value of account status is open, where it’s actual value is closed. it’s costly because, the prediction gives the business a false sense of security that the accounts are safe and will remain open.
There are various metrics based on the confusion matrix. They are as follows. Correctness of the overall prediction is expressed by accuracy.
accuracy = (TP + TN) / (TP + TN + FP + FN)
recall = TP / (TP +FN)
precision = TP / (TP + FP)
Proportion of positive cases predicted correctly by the classifier to the actual number of positive cases is indicated by recall. The proportion cases that are actually positive to the cases predicted to be positive is indicated by precision. Here the different metric values as processed by Hadoop counters
13/02/18 23:01:56 INFO mapred.JobClient: Validation 13/02/18 23:01:56 INFO mapred.JobClient: Recall=60 13/02/18 23:01:56 INFO mapred.JobClient: Correct=632 13/02/18 23:01:56 INFO mapred.JobClient: Precision=67 13/02/18 23:01:56 INFO mapred.JobClient: TrueNagative=307 13/02/18 23:01:56 INFO mapred.JobClient: FalseNegative=214 13/02/18 23:01:56 INFO mapred.JobClient: Accuracy=63 13/02/18 23:01:56 INFO mapred.JobClient: Incorrect=368 13/02/18 23:01:56 INFO mapred.JobClient: TruePositive=325 13/02/18 23:01:56 INFO mapred.JobClient: FalsePositive=154
They are all processed under a counter group called Validation. As we can see, the quality of our model could be better. In data mining, the quality of the model depends heavily on the quantity and quality of the data.
Prediction output can be controlled in two ways. First, you may be interested in some specific outcome only e.g., account closed. You may only be interested in the probability of account closed, whether or not it’s the most probable outcome. The desired class attribute value is specified through the configuration parameter. Actually you can specify a subset of the possible values for the class attribute.
Second, you are interested the class attribute value with the highest probability i.e., the most probable outcome. This is the approach taken for our example
There is a caveat when multiple class attribute values are specified. In such cases, you may want to define a threshold, such that the class attribute value with the highest probability value will have to have a probability higher than the sum of the probability of the class attribute with the next highest probability and the defined threshold. Basically, we are making the class attribute value with highest probability more discriminative. This is defined through the configuration parameter.
For example, using a scale of 100 if this threshold value is set to 30 and the probability of close is found to be higher that the probability of open for a record, then it will be classified as closed only if P(close) >= P(open) + 30. When the threshold is set, the output will consist of an extra field as the last field. It’s value will be either classified, when the the threshold condition is met. Otherwise, it’s set to ambiguous. When ambiguous, the classifier is saying “I give up. I can’t figure this out”
Bayesian classifiers are simple and it’s not very computationally intensive to build models with large data sets. Another advantage of Bayesian classifiers over other classifiers is that it can easily handle incremental training data. It’s not necessary to re process the whole training data set when new data arrives incrementally.
In data mining, there is no one size fit all algorithm and solution. Whether Bayesian classifier works for you will depend on the problem and the data set.