Auto Training and Parameter Tuning for a ScikitLearn based Model for Leads Conversion Prediction

This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in my open source project avenir.

The auto training logic as used here is independent of any particular supervised learning algorithm and applicable for any learning algorithm.

The frame work around ScikitLearn, used here facilitates building predictive models without having to write python code. I will be adding other supervised learning algorithms to this framework in future.

Machine Learning Pipeline Optimization

A machine learning pipeline for supervised learning, typically has the following 4 stages. Except for the first stage, they all require optimization to find the best pipeline  for the data set you have. In this post, our focus is on the last  stage only.

  1. Data pre processing and clean up
  2. Feature Engineering with feature selection or feature reduction
  3. Supervised learning model selection
  4. Algorithm specific parameter tuning

Model Complexity, Error Rate and Training Data Size

For any predictive modeling problem you will be juggling with these 3 parameters.  There is complex non linear relationship between them as alluded to in my earlier blog under the section Machine Learning Commandments.

Essentially, it’s a complex optimization problem, where you want to find the combination of parameters and data set size that yield minimum test error. If there are there practical limits on available training data size , then it’s a constrained optimization problem.

Model Complexity is dictated by various algorithm specific  parameters and the number of features being used in training the model. Model complexity is measured by something called VC Dimension.

There is complex non linear relationship between VC dimension and training data size. You need more training data with increasing VC Dimension, for the test error to be below a threshold with a given probability.

There are two kinds of error rates,  training error and test error. Test error is error from a data set different from the training data set, but something that was generated by the same underlying generating process as the training data.

Test error is the sum of training error and generalization error. In ideal scenario, for a properly trained model, training error and test error converge i.e, generalization error approaches zero.

A more complex model requires more training data. However due to various reasons, the training data size may be limited. You may not have access to training data beyond some size.

Auto Training and Parameter Tuning

As alluded to in my last blog, I have built an abstraction framework ScikitLearn so that you can build predictive model through configuration and without writing any python code. I have added a method called autoTrain() to the wrapper class for the predictive model. Auto training works as follows

  1. Searches through the parameter space.  For each combination parameter values, does k fold cross validation reports the test error
  2. Reports the parameter combination that yields the smallest test error.
  3. For the best parameter combination as found in step 2, train a model and report training error.
  4. Reports the average of training and test error and the difference between test and training error

To run in autoTrain(), the following parameters need to be set in the properties configuration file.

  1. Set mode common.mode=autoTrain
  2. Set train.search.param.strategy=guided
  3. Define the parameters to be used for the parameters search space as train.search.params=train.search.learning.rate:float,train.search.num.estimators:int
  4. Set train.search.learning.rate=0.8,0.12
  5. Set train.search.num.estimators=100,120
  6. Set max test error train.auto.max.test.error=0.06
  7. Set max average error threshold  train.auto.max.error=0.04
  8. Set test and train error difference max threshold train.auto.max.error.diff=0.02

In step2, parameters space search strategy is set. Currently it supports only the grid search i.e. brute force search through all possible parameter value combinations. In future, I will support more intelligent search and optimization algorithms e.g., Simulated Annealing.

In step 3, I have used only 2 parameters to define the search space. You could add more parameters e.g train.search.max.depth. The parameters for search space has the same name as parameter names as in simple train mode, except that the word search is inserted in the parameter names.

In step 4 and 5 I have provided a list of values for each parameter specified in step 3. You could additional values. I have used only 2 values for each parameters, resulting in 4 iterations through the search space.

The stopping criteria is specified through steps 5, 6 and 7. There are two ways for successfully completing  the training. Based on 5, if the test error is below a threshold, training is considered to have successfully completed.

Through 6 and 7 constraints are placed on bias error and generalization error. Training is considered to have successfully completed when both the bias and generalization errors are below thresholds specified. The average of training and test error is an approximation of the error due to bias.

Four Outcomes of Auto Training

These 4 outcomes refer to the second training completion criteria discussed above  based on bias error and generalization error. Depending on the average error and the error difference you get, there are 4 possible scenarios as below. For each out come there is remedial action that make be taken to improve your model.

Average error is the average of training error and test error. For a given complexity and enough training data, this is an estimation of the true converged error.  it reflects the bias in the model. The difference between test and training error is the generalization error.

average error = (training error + test error) / 2 
error difference = test error – training error

Error difference is the difference between the test and training error i.e., generalization error.

Average error below threshold Average error above threshold
Error difference below threshold Case 1:  Best scenario. You are done. Model has converged with acceptable error Case 2:  Model has converge but with high error level. Should use more complex model
Error difference above threshold Case 3:  Model has not converged, although error level is acceptable. Need more training data Case 4: Worst scenario. Increase model complexity and use more training data

Lower test error should also be used as a convergence criteria. As we will see later, sometimes the large drop in error doe to bias will pull the test error lower but still with large generalization error. In other words, smallest generalization error may not correspond smallest  test error.

Auto Training First Run

You may want to keep these lecture notes open, if you want to verify and interpret the results from auto train. This is what happened when I ran autoTrain. My results corresponded to case 3 as shown below. As is evident, the model needed more training data to reduce the generalization error further.

all parameter search results
train.learning.rate=0.8  train.num.estimators=100  	0.074
train.learning.rate=0.8  train.num.estimators=120  	0.073
train.learning.rate=0.12  train.num.estimators=100  	0.067
train.learning.rate=0.12  train.num.estimators=120  	0.063
best parameter search result
train.learning.rate=0.12  train.num.estimators=120  	0.063
train.learning.rate  0.12
train.num.estimators  120
...building model
subsample size  4000
...training model
error with training data 0.040
Auto training  completed: training error 0.040 test error: 0.063
Average of test and training error: 0.052 test and training error diff: 0.023
High generalization error. Need larger training data set

I ran autoTrain with a training data size of 5000. With 5 fold cross validation, 4000 records were used for training and 1000 for validation.

Auto Training with Increased Training Data

Since the test and training error didn’t converge i.e., high generalization error, I am being asked to retrain with larger training data size. I increased the training data size to 7000 and ran autoTrain again and voila, we have success.

It’s possible that you didn’t have a bigger training data set. In that case you have to settle for the sub optimal solution.

all parameter search results
train.learning.rate=0.8  train.num.estimators=100  	0.065
train.learning.rate=0.8  train.num.estimators=120  	0.065
train.learning.rate=0.12  train.num.estimators=100  	0.060
train.learning.rate=0.12  train.num.estimators=120  	0.055
best parameter search result
train.learning.rate=0.12  train.num.estimators=120  	0.055
train.learning.rate  0.12
train.num.estimators  120
...building model
subsample size  5600
...training model
error with training data 0.039
Auto training  completed: training error 0.039 test error: 0.055
Average of test and training error: 0.047 test and training error diff: 0.016
Successfullt trained. Low generalization error and low error level
...training the final model
...building model
...training model
error with training data 0.041
...saving model
training error in final model 0.041

The generalization error i.e the difference between test and training error dropped from 0.023 to 0.016

When training is successful,as we have now, the final model is trained and saved.  The model may be  ready to be used  in production deployment.

Auto Training with Increased Model Complexity

Although the test error dropped from 0.063 to 0.055 in the second run, I noticed that I could get even better result by shifting to a different region of the feature space reflecting more complex models.

I am exploring a small region of the feature search space with the 2 parameter and unless I happened to be lucky, it’s natural that the best parameter value combination is yet to be found.

Based on the trend of the test error with various parameter combinations, I decided to shift to a different region of the feature space as defined below and give it another try.

train.search.learning.rate=0.14,0.16
train.search.num.estimators=140,160

Here are the results. There is further drop in test error from 0.055 to as we can see below. What’s interesting is that generalization error has gone up from 0.016 to to 0.021. It can attributed to the fact that we have increased error due to variance by making increasing the model complexity.

Training error has dropped from 0.039 to 0.027, which reflects the fact we have reduced error due to bias by adopting more complex models.

The net effect can be summarized as follows. Reduction in error due to reduced bias has more than offset increase in error due to higher variance resulting in net decrease in test error.

all parameter search results
train.learning.rate=0.14  train.num.estimators=140  	0.050
train.learning.rate=0.14  train.num.estimators=160  	0.049
train.learning.rate=0.16  train.num.estimators=140  	0.049
train.learning.rate=0.16  train.num.estimators=160  	0.048
best parameter search result
train.learning.rate=0.16  train.num.estimators=160  	0.048
train.learning.rate  0.16
train.num.estimators  160
...building model
subsample size  5600
...training model
error with training data 0.027
Auto training  completed: training error 0.027 test error: 0.048
Average of test and training error: 0.038 test and training error diff: 0.021
High generalization error. Need larger training data set

In this case, because the generalization error has exceeded our defined threshold, the final model was not built and saved.

Summing Up

You may be wondering why did I have to make some many runs if I am doing auto training. The answer is I chose a very small region the parameter search space for instant gratification. You could cast a wider net by choosing an wider parameter search space.

You can widen the parameter search space by including more relevant parameters and using bigger range of values for each parameter.. You might be able to accomplish everything in one run. However with an wider parameter search space, processing time will increase linearly with the number of parameter value combinations being tried.

My goal was to build a framework around ScikitLearn for training predictive models with configuration only and without any python code.

The tutorial has been updated with content for auto training. Please follow the steps there if you want to try running in auto train mode.

ThirdEye Data

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye