This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in my open source project avenir.
The auto training logic as used here is independent of any particular supervised learning algorithm and applicable for any learning algorithm.
The frame work around ScikitLearn, used here facilitates building predictive models without having to write python code. I will be adding other supervised learning algorithms to this framework in future.
Machine Learning Pipeline Optimization
A machine learning pipeline for supervised learning, typically has the following 4 stages. Except for the first stage, they all require optimization to find the best pipeline for the data set you have. In this post, our focus is on the last stage only.
- Data pre processing and clean up
- Feature Engineering with feature selection or feature reduction
- Supervised learning model selection
- Algorithm specific parameter tuning
Model Complexity, Error Rate and Training Data Size
For any predictive modeling problem you will be juggling with these 3 parameters. There is complex non linear relationship between them as alluded to in my earlier blog under the section Machine Learning Commandments.
Essentially, it’s a complex optimization problem, where you want to find the combination of parameters and data set size that yield minimum test error. If there are there practical limits on available training data size , then it’s a constrained optimization problem.
Model Complexity is dictated by various algorithm specific parameters and the number of features being used in training the model. Model complexity is measured by something called VC Dimension.
There is complex non linear relationship between VC dimension and training data size. You need more training data with increasing VC Dimension, for the test error to be below a threshold with a given probability.
There are two kinds of error rates, training error and test error. Test error is error from a data set different from the training data set, but something that was generated by the same underlying generating process as the training data.
Test error is the sum of training error and generalization error. In ideal scenario, for a properly trained model, training error and test error converge i.e, generalization error approaches zero.
A more complex model requires more training data. However due to various reasons, the training data size may be limited. You may not have access to training data beyond some size.
Auto Training and Parameter Tuning
As alluded to in my last blog, I have built an abstraction framework ScikitLearn so that you can build predictive model through configuration and without writing any python code. I have added a method called autoTrain() to the wrapper class for the predictive model. Auto training works as follows
- Searches through the parameter space. For each combination parameter values, does k fold cross validation reports the test error
- Reports the parameter combination that yields the smallest test error.
- For the best parameter combination as found in step 2, train a model and report training error.
- Reports the average of training and test error and the difference between test and training error
To run in autoTrain(), the following parameters need to be set in the properties configuration file.
- Set mode
- Define the parameters to be used for the parameters search space as train.search.params=
- Set max test error
- Set max average error threshold
- Set test and train error difference max threshold
In step2, parameters space search strategy is set. Currently it supports only the grid search i.e. brute force search through all possible parameter value combinations. In future, I will support more intelligent search and optimization algorithms e.g., Simulated Annealing.
In step 3, I have used only 2 parameters to define the search space. You could add more parameters e.g. The parameters for search space has the same name as parameter names as in simple train mode, except that the word search is inserted in the parameter names.
In step 4 and 5 I have provided a list of values for each parameter specified in step 3. You could additional values. I have used only 2 values for each parameters, resulting in 4 iterations through the search space.
The stopping criteria is specified through steps 5, 6 and 7. There are two ways for successfully completing the training. Based on 5, if the test error is below a threshold, training is considered to have successfully completed.
Through 6 and 7 constraints are placed on bias error and generalization error. Training is considered to have successfully completed when both the bias and generalization errors are below thresholds specified. The average of training and test error is an approximation of the error due to bias.
Four Outcomes of Auto Training
These 4 outcomes refer to the second training completion criteria discussed above based on bias error and generalization error. Depending on the average error and the error difference you get, there are 4 possible scenarios as below. For each out come there is remedial action that make be taken to improve your model.
Average error is the average of training error and test error. For a given complexity and enough training data, this is an estimation of the true converged error. it reflects the bias in the model. The difference between test and training error is the generalization error.
average error = (training error + test error) / 2
error difference = test error – training error
Error difference is the difference between the test and training error i.e., generalization error.
|Average error below threshold||Average error above threshold|
|Error difference below threshold||Case 1: Best scenario. You are done. Model has converged with acceptable error||Case 2: Model has converge but with high error level. Should use more complex model|
|Error difference above threshold||Case 3: Model has not converged, although error level is acceptable. Need more training data||Case 4: Worst scenario. Increase model complexity and use more training data|
Lower test error should also be used as a convergence criteria. As we will see later, sometimes the large drop in error doe to bias will pull the test error lower but still with large generalization error. In other words, smallest generalization error may not correspond smallest test error.
Auto Training First Run
You may want to keep these lecture notes open, if you want to verify and interpret the results from auto train. This is what happened when I ran autoTrain. My results corresponded to case 3 as shown below. As is evident, the model needed more training data to reduce the generalization error further.
all parameter search results0.074 0.073 0.067 0.063 best parameter search result 0.063 0.12 120 ...building model subsample size 4000 ...training model error with training data 0.040 Auto training completed: training error 0.040 test error: 0.063 Average of test and training error: 0.052 test and training error diff: 0.023 High generalization error. Need larger training data set
I ran autoTrain with a training data size of 5000. With 5 fold cross validation, 4000 records were used for training and 1000 for validation.
Auto Training with Increased Training Data
Since the test and training error didn’t converge i.e., high generalization error, I am being asked to retrain with larger training data size. I increased the training data size to 7000 and ran autoTrain again and voila, we have success.
It’s possible that you didn’t have a bigger training data set. In that case you have to settle for the sub optimal solution.
all parameter search results0.065 0.065 0.060 0.055 best parameter search result 0.055 0.12 120 ...building model subsample size 5600 ...training model error with training data 0.039 Auto training completed: training error 0.039 test error: 0.055 Average of test and training error: 0.047 test and training error diff: 0.016 Successfullt trained. Low generalization error and low error level ...training the final model ...building model ...training model error with training data 0.041 ...saving model training error in final model 0.041
The generalization error i.e the difference between test and training error dropped from 0.023 to 0.016
When training is successful,as we have now, the final model is trained and saved. The model may be ready to be used in production deployment.
Auto Training with Increased Model Complexity
Although the test error dropped from 0.063 to 0.055 in the second run, I noticed that I could get even better result by shifting to a different region of the feature space reflecting more complex models.
I am exploring a small region of the feature search space with the 2 parameter and unless I happened to be lucky, it’s natural that the best parameter value combination is yet to be found.
Based on the trend of the test error with various parameter combinations, I decided to shift to a different region of the feature space as defined below and give it another try.
Here are the results. There is further drop in test error from 0.055 to as we can see below. What’s interesting is that generalization error has gone up from 0.016 to to 0.021. It can attributed to the fact that we have increased error due to variance by making increasing the model complexity.
Training error has dropped from 0.039 to 0.027, which reflects the fact we have reduced error due to bias by adopting more complex models.
The net effect can be summarized as follows. Reduction in error due to reduced bias has more than offset increase in error due to higher variance resulting in net decrease in test error.
all parameter search results0.050 0.049 0.049 0.048 best parameter search result 0.048 0.16 160 ...building model subsample size 5600 ...training model error with training data 0.027 Auto training completed: training error 0.027 test error: 0.048 Average of test and training error: 0.038 test and training error diff: 0.021 High generalization error. Need larger training data set
In this case, because the generalization error has exceeded our defined threshold, the final model was not built and saved.
You may be wondering why did I have to make some many runs if I am doing auto training. The answer is I chose a very small region the parameter search space for instant gratification. You could cast a wider net by choosing an wider parameter search space.
You can widen the parameter search space by including more relevant parameters and using bigger range of values for each parameter.. You might be able to accomplish everything in one run. However with an wider parameter search space, processing time will increase linearly with the number of parameter value combinations being tried.
My goal was to build a framework around ScikitLearn for training predictive models with configuration only and without any python code.
The tutorial has been updated with content for auto training. Please follow the steps there if you want to try running in auto train mode.