Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively providing special care for weak leads and for projecting future revenue .
In this post we will go over a predictive modeling solution built on Python ScikitLearn Machine Learning Library. We will be using Gradient Boosted Tree(GBT) which is a powerful and popular supervised learning algorithm.
During the course of this blog, we will also see how the abstraction layer I have built around Scikit supervised learning algorithms works. The abstraction layer along with property file based configuration management makes model building and model life cycle management significantly easier. The implementation can be found in my OSS project avenir.
Gradient Boosted Trees
Boosting is an ensemble technique in Machine Learning. Ensemble is a way to combine multiple simple or weak models to create more powerful models. The difference between different boosting algorithms depends on how the simple models are combined. Here are the main steps in boosting.
- Build an initial simple model
- Build a next model based on the prediction accuracy of the models built so far, taken together. This step addresses the shortcomings of the simple models built so far
- Repeat step 2 until the stopping condition is reached
With Gradient Boosted Trees(GBT), a numerical optimization is performed where the objective is to minimize loss of the ensemble by adding simple or base learners using gradient descent procedure. The simple or base learners for GBT are regression trees, which are added to the ensemble in an additive way. Since the gradient is taken w.r.t to the base learner or function, the procedure is also known as Functional Gradient Boosting (FGD).
The shortcomings of the existing ensemble while adding a new simple model is measured by the gradient of the loss. To be more specific, the parameters for the new base model to be added, is chosen such that loss is reduced, moving in the direction of the negative gradient of the loss function.
Sales Lead Data from CRM
We will be using sales lad data from a fictitious CRM system with the following attributes as our use case.
- source of lead e.g trade show, web down load etc
- lead contact type e.g has recommendation authority
- lead company size
- number of days so far in CRM pipeline
- number of meetings so far with lead and others in lead’s company
- number of emails exchanged so far with lead and others in lead’s company
- number of web site visits so far by the lead and and others in lead’s company
- number of demos so far
- expected revenue from the deal
- whether proposal with price quote sent
- whether the lead converted
Since most of the feature attributes are time dependent i.e time spent in the sales pipeline, the prediction also depends on the time spent in the pipeline.
The first attribute is an ID which is ignored. The last attribute is the target or class label. The rest are feature attributes. Here is some sample data.
7E561S3X62,referral,canReccommend,large,50,7,14,4,3,49679,N,1 RH19V26CX5,tradeShow,canDecide,large,58,6,10,8,4,46127,N,0 3GXMOW46MA,referral,canReccommend,large,84,6,8,3,3,30000,N,1 3WYD4CY31A,advertisement,canDecide,medium,42,9,5,2,2,44659,N,0 GUXRCLV835,webDownload,canReccommend,large,43,4,8,1,3,46172,N,0 9XXGCOBGWR,webDownload,canDecide,small,23,5,12,1,4,45789,N,0 SGZS58ESSU,webDownload,canReccommend,medium,67,6,12,6,3,38449,N,0
Predictive Modeling Framework
To make it easier and to keep it inline with the predictive model development life cycle process, I have created an abstraction wrapper class on top of ScikitLearn suoervised learning algorithms. The API of the abstraction class consists of these essential methods, each of them corresponding to a phase in the predictive model development life cycle.
- train() : Builds predictive model and reports the training error. Used to decide the trade off between model complexity and training data size, by keeping training error within acceptable limit.
- trainValidate(): Build predictive model, cross validates and reports test or generalization error. Uses K Fold cross validation or comparable techniques. You can do parameter tuning to minimize generalization error by searching the parameter space.
- predict(): Makes prediction. This is the method that get called when the predictive model is deployed for use.
- validate(): Makes prediction using existing predictive model and newer data and reports error. Used to detect predictive model drift, using newer data and an existing model.
Parameter tuning during training and validation is an optimization problem, where our goal is to find the combination of parameter values that gives us the lowest generalization error.
Depending on the number of parameters and the values for the parameters, you may be up against a combinatorial explosion problem, running into millions of possible combination of parameters values.
Grid search through the parameter space is not practical for such scenario. Generally this is done with grid search or random search optimization algorithms with Machine Learning libraries.
I am working on various stochastic optimization algorithms for parameter tuning. The user will be able to choose the parameter optimization technique desired with appropriate configuration.
With the framework and the provided driver code in avenir, you can use ScikitLearn predictive modeling algorithms without writing any python code. A comprehensive property file based configuration makes this possible.
The configuration parameters are divided into multiple groups as below. Except for common, each group has direct correspondence to the framework methods listed above.
- common : These configuration parameters algorithmic agnostic and are required for all of the frame works methods
- train: Contains configuration parameters for train() and tranValidate() methods. Not all parameters under this group gets used by tarin() or trainValidate()
- predict: Contains configuration parameters for predict() when the model gets deployed in production
- validate: Contains configuration parameters for validate() used to detect model drift, after it’s been deployed and in use
Here is the complete list of configuration with explanation. Each configuration parameter name is prefixed with the group names listed above. The values are are to be treated as sample. You are free to change them. Gradient Boosting related parameters are indicated along with corresponding ScikitLearn parameter names
Default value is indicated by _. You can also use None, to indicate that no value is specified for a parameter. If a configuration parameter is mandatory, there is no default and it’s not provided, an exception gets thrown.
|Name and Value||Comment|
|= trainValidate||mode of execution|
|= model||model save directory|
|= crm_gb_model||saved model file name|
|common.preprocessing = _||pre processing steps|
|= leads_5000.txt||input data file name|
|= 0,1,2,3 etc.||coma separated list of column indexes|
|= 0,1,2 etc||coma separated list of feature column indexes|
|= 17||class field index|
|train.validation = kfold||cross validation method|
|= 5||number of folds|
|= 4||GBT specific (min_samples_split)|
|= 4||GBT specific (min_samples_leaf)|
|= 0.1||GBT specific (min_weight_fraction_leaf)|
|= 3||GBT specific (max_depth)|
|= None||GBT specific (max_leaf_nodes)|
|= _||GBT specific (max_features)|
|= 0.10||GBT specific (learning_rate)|
|= 100||GBT specific (n_estimators)|
|train.subsample = _||GBT specific (subsample)|
|= _||GBT specific (loss)|
|= _||GBT specific (init)|
|= 100||GBT specific (random_state)|
|train.verbose = _||GBT specific (verbose)|
|= _||GBT specific (warm_start)|
|train.presort = _||GBT specific (presort)|
|train.criterion = _||GBT specific|
|train.success.criterion = error||whether to output performance metric or it’s inverse|
|= False||whether to save model|
|= accuracy||GBT specific|
|parameter tuning optimization strategy|
|train.search.params =, etc||parameters to be used for parameter tuning|
|= leads_1000.txt||input file for prediction|
|= 1,2||coma separated list of column indexes|
|= 0,1, etc||coma separated list of feature column indexes|
|= True||whether saved trained model should be used|
|= leads_5000.txt||input file for validation|
|, etc||coma separated list of column indexes|
|, etc||coma separated list of feature column indexes|
|= 17||class field index|
|= False||whether saved trained model should be used|
|= confusionMatrix||performance metric|
This article provides good guidance and details on configuration parameters for Gradient Boosted Trees in ScikitLearn.
Parameter Space Search for Optimum Tuning
When the mode is trainValidate and the parameteris set, then it will do search through the parameters space to find optimum combination of parameter values.
The parameters to be included in search space needs to be provided as a coma separated list through the parameter train.search.params. For all the parameters specifed in train.search.params, the corresponding parameters should have a list of coma separted values, instead of one.
Currently only guided search is supported, where the user needs to provide all the values for a parameter to be included in the search. I am working on implementing and supporting few other stochastic optimization algorithms.
Machine Learning Commandments
In building optimal predictive model, we have the following two free parameters to play with
- Training data size
- Model complexity
Sometimes you are limited with a maximum training data size. In that case you take the largest training data size and play around with the model complexity parameters.
The relationship between the training data size, model complexity and error rate is complex and is characterized as follows.
- For a given model complexity, training error increases with training data size, asymptotically approaching the true error.
- For a given model complexity, test or generalization error decreases with training data size, asymptotically approaching the true error.
- If the difference between the training and test error is large even with the largest training data set you have, you may need more training data for the two errors to converge.
- If the training error and test error have converged with but with high error value, you have a simple model with not enough complexity. You need to increase the model complexity.
- For a given training data size, training error decreases with model complexity
- For a given training data size, test error decreases with model complexity up to a point of optimal complexity and then starts increasing.
- The optimal complexity of a model increases with training data size and then reaches a plateau, beyond which additional training data does not make any difference, because the model has achieved sufficient complexity
Predictive Model Training Workflow
Based on our knowledge of the interplay between training data size, model complexity and error rate, we can define the following workflow for building predictive models.
- For some model complexity, train models with increasing data size and find the data size where the error rate seems to plateau. In this step you may be limited by the maximum available data size.
- If the training error rate is unacceptable, increase model complexity and repeat step 1. Again you may be limited by the maximum amount of available training data.
- For the data size from the previous step, train and validate model using parameter search. Perturb some key parameters around the fixed set of values used in step 2. Find the optimal parameters.
- If there is a large gap between test and training error, go back to step 1 with model complexity obtained from step 3 and repeat from step 1 onward
Results form Training a Model
For the training phase, for some initial model complexity parameters, I trained the model with training data size of 2500, 5000 and 10000. Here are the results with training error.
2500 running mode: train ...building model ...training model error with training data 0.043 5000 running mode: train ...building model ...training model error with training data 0.054 10000 running mode: train ...building model ...training model error with training data 0.057
Training error rate seems to level off for a data size of 5000. This step corresponds to step 1 above. The two key parameters that we will use for training and validation with parameter search are learning rate and the number of tree instances. Their values for the training phase is below.
Next, we will perform train and validate with k fold cross validation using training data size of 5000 which will correspond to step 3. We will consider 3 possible values for each of the 2 parameters, resulting in 9 combinations as below. I chose those two among many, because they seem to to be the most critical parameters.
Here are the results for 9 possible combinations of the 2 parameters. It also shows the parameter value combination corresponding to smallest error rate
all parameter search results0.126 0.114 0.098 0.114 0.096 0.076 0.093 0.078 0.063 best parameter search result 0.063
The generalization error of 0.063 is acceptable, and it’s 17% more than the training error of 0.054. The optimal parameter values for the 2 parameters for lowest generalization error is slightly different from what I used for training. The training phase values for the 2 parameters are ScikitLearn default values.
I ran it in the train mode again with optimal values of the 2 parameters we found from the train and validate. Here is the result
running mode: train ...building model ...training model error with training data 0.043
Interestingly, the gap between the train and generalization error increased. Now the test or generalization error is 46% more than the training error. According to commandment #3 as above, we need more training data e.g. 6000 or 7000 and start over. I haven’t done it. If it piques your curiousity, you could try.
My parameter search space consisted of only 2 parameters. By no means, can I claim that I have the optimal parameter values for the lowest generalization error, because the search space not exhaustive enough. If you are curious, you could include more parameters and and see if you can find more optimum parameter values.
In predictive modeling, there is a complex and nonlinear relationship between model complexity, training data size and the generalization error. We need a model complex enough to reflect the the complexity of the underlying process that generates the data. For a model with given complexity we need enough training data. Finding the optimal model is an iterative process.
In this post we have focussed on training the predictive model. In a future post, I will discuss the other life cycle phases of a model development i.e production deployment for prediction, model drift and retraining.
The tutorial document has the details on how to generate the data and execute the Python driver code to call the GBT wrapper class methods.