Classification problems involve predicting a response variable based on a set of feature variables for some entity. But there is another problem whose solution is a prerequisite for solving classification problem. We may want to know which among the set of feature variables are most strongly correlated to the response variables. Once we have identified those, we may only want to use that sub set of the feature variables to build the prediction model.
To put this in context, we will use the customer churn prediction problem, specifically for mobile telecom service provider customers. A customer may have attributes like minutes used, data used, number of customer service calls etc. We are interested in identifying those feature attributes critical to effective prediction of whether a customer will close his or her account. It’s always beneficial to reduce the number of features in building prediction models. It’s known as feature sub set selection.
For numerical variables, correlation between two variables is simply the co variance normalized by the products of two standard deviations. For categorical variables, i.e. variables having a finite set of unordered values it’s more complex. There are several approaches for handling categorical data. The correlation statistic we will be using is called the Cramer Index.
The basic building block for many correlation statistic between categorical variables is the the Contingency Matrix. If a variable a has n possible values and a variable b has m possible values. Then the contingency matrix will me a n X m matrix. Each cell of the matrix will contain a count of the number of samples that have the corresponding attribute value pair.
If the we consider the attribute pair minutes used (MU) and account status (AS), we have a 3 x 2 Contingency Matrix as shown below
|MU(low) : AS(open)||MU(low) : AS(closed)|
|MU(med) : AS(open)||MU(med) : AS(closed)|
|MU(high) : AS(open)||MU(high) : AS(closed)|
Just an inspection of the matrix, may provide valuable insight. For example, if we see high value for the bottom left cell i.e. minutes used high and account status open, we know customers are closing account, there is no suitable plan for high minute usage.
This is how Cramer Index is defined in terms of Contingency Matrix. The index depends on how concentrated the values are across the cells.
CramerIndex = (sum(n(i,j) * n(i,j) / (nr(i) * nc(j))) – 1) / (min(numRow, numCol) – 1)
n(i,j) = value of (i,j) cell of contingency matrix
nr(i) = sum of values over all columns for the i th row
nc(j) = sum of values over all rows for the j th column
sum = sum over all i and j
The Cramer index will always be between 0 and 1, 0 indicating weakest correlation and 1 indicating the strongest correlation.
The attributes to be correlated are specified through the configuration parameters source.attributes and dest.attributes. In the initialize() method of the mapper, for all the possible attribute pairs from the two sets an instance of a contingency matrix is created.
The Cramer Index implementation is part of my open source project avenir, which contains a collection classification and prediction algorithms implemented on Hadoop. This specific map reduce implementation is available here.
As the mapper processes each record. for each possible attribute pairs from the two sets, the values are extracted from the record. The value pair is used to locate a cell in the corresponding contingency matrix and it’s value incremented.
In the cleanup() method of the mapper, the contingency matrix for each attribute pair is emitted. The key is the attribute pair and the value is the serialized contingency matrix.
On the reducer side, contingency matrices for a given key i.e. attribute pair are aggregated and the Cramer Index calculated based on the final aggregated contingency matrix. The reducer emits attribute pair followed by Cramer Index.
Customer Churn Analysis
The mobile service provider customer data has the following feature attributes in the data I am using . The data is over a period going into the past.
- minute used (low, med, high, overage)
- data used (low, med, high)
- customer service calls (low, med, high)
- payment history (poor, average, good)
- account age (low, med, high)
Here are some sample input. The first field is the customer ID and the last field is the response attribute i.e., the account status. The remaining fields are the feature attributes as listed above.
KX9LBZ3ZVLII,med,med,med,poor,4,open 94PMT4ZQU47W,overage,high,low,average,1,closed DIINUH7HZUUX,low,high,med,good,4,open H6W0HROO0H2X,high,low,high,average,4,open 31P1TG4RTGQI,overage,med,low,average,1,closed GTL7W53933LU,high,med,low,good,4,closed JU39F4BSB70Z,overage,low,low,poor,2,open 2A4RURJLJ5EZ,high,high,low,good,3,open FS2DZZ2VK063,low,low,low,good,2,open B3U4OECQ628K,med,med,med,poor,3,closed 5OWQFS2EGIKV,med,low,low,good,2,open JU2JVU0WL1Y1,overage,med,low,average,2,open
Most of the attributes above are numerical. They have been discretized into categorical values. The response variable is account status which is either open or closed. Each feature variable is paired with the response variable and the corresponding Cramer Index is emitted as output.
Here is the output. Form the output, we can see that minutes used has the strongest correlation to open status.
minUsed,status,0.022663449872222907 dataUsed,status,0.0038947124486503615 CSCalls,status,0.010164336836900434 payment,status,0.00905448707197265 acctAge,status,0.0030426459155057373
Not only does correlation calculation helps you identify the critical feature attributes towards the final prediction, it also provides valuable insight. Sometimes that insight is all you need, even if you don’t go all the way to solve the blown prediction problem.
For example, if high minutes used is found to have the strongest correlation to account closing, a mobile service provider could pro actively seek out such customers, and offer them alternative calling plans before they leave. Here is the tutorial for the example.