Predicting CRM Lead Conversion with Gradient Boosting using ScikitLearn

Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively  providing special care for weak leads and for projecting future revenue .

In this post we will go over a predictive modeling solution built on Python ScikitLearn Machine Learning Library. We will be using Gradient Boosted Tree(GBT) which is a powerful and popular supervised learning algorithm.

During the course of this blog, we will also see  how the abstraction layer I have built around Scikit supervised learning algorithms works. The abstraction layer along with property file based configuration management  makes model building and model life cycle management significantly easier. The implementation can be found in my OSS project avenir.

Gradient Boosted Trees

Boosting is an ensemble technique in Machine Learning. Ensemble is a way to combine multiple simple or weak models to create more powerful models. The difference  between  different  boosting algorithms depends on how the simple models are combined. Here are the main steps in boosting.

  1. Build an initial simple model
  2. Build a next model based on the prediction accuracy of the models built so far, taken together. This step  addresses the shortcomings of the simple models built so far
  3. Repeat step 2 until the stopping condition is reached

With Gradient Boosted Trees(GBT), a numerical optimization is performed where the objective is to minimize loss of the ensemble  by adding simple or base learners using gradient descent procedure. The simple or base learners for GBT are regression trees, which are added to the ensemble in an additive way. Since the gradient is taken w.r.t to the base learner or function, the procedure is also known as Functional Gradient Boosting (FGD).

The shortcomings of the existing ensemble while adding a new simple model is measured by the gradient of the loss. To be more specific, the parameters for the new base model to be added, is chosen such that loss is reduced, moving in the direction of the negative gradient of the loss function.

Sales Lead Data from CRM

We will be using sales lad data from a fictitious CRM system with the following attributes as our use case.

  1. id
  2. source of lead e.g trade show, web down load etc
  3. lead contact type e.g has recommendation authority
  4. lead company size
  5. number of days so far in CRM pipeline
  6. number of meetings so far with lead and others in lead’s company
  7. number of emails exchanged so far with lead and others in lead’s company
  8. number of web site visits so far by the lead and and others in lead’s company
  9. number of demos so far
  10. expected revenue from the deal
  11. whether proposal with price quote sent
  12. whether the lead converted

Since most of the feature attributes are time dependent i.e time spent in the sales pipeline, the prediction also depends on the time spent in the pipeline.

The first attribute is an ID which is ignored. The last attribute is the target or class label. The rest are feature attributes.  Here is  some sample data.

7E561S3X62,referral,canReccommend,large,50,7,14,4,3,49679,N,1
RH19V26CX5,tradeShow,canDecide,large,58,6,10,8,4,46127,N,0
3GXMOW46MA,referral,canReccommend,large,84,6,8,3,3,30000,N,1
3WYD4CY31A,advertisement,canDecide,medium,42,9,5,2,2,44659,N,0
GUXRCLV835,webDownload,canReccommend,large,43,4,8,1,3,46172,N,0
9XXGCOBGWR,webDownload,canDecide,small,23,5,12,1,4,45789,N,0
SGZS58ESSU,webDownload,canReccommend,medium,67,6,12,6,3,38449,N,0

Predictive Modeling Framework

To make it easier and to keep it inline with the  predictive model development life cycle process, I have created an abstraction wrapper class on top of ScikitLearn  suoervised  learning algorithms. The API of the abstraction class consists of these essential methods, each of them corresponding to a phase in the predictive model development life cycle.

  • train() : Builds predictive model and reports the training error. Used to decide the trade off between model complexity and training data size, by keeping training error within acceptable limit.
  • trainValidate(): Build predictive model, cross validates and reports test or generalization error. Uses K Fold cross validation or comparable techniques. You can do parameter tuning to minimize generalization error by searching the parameter space.
  • predict(): Makes prediction. This is the method that get called when the predictive model is deployed for use.
  • validate(): Makes prediction using existing predictive model and newer data and reports error. Used to detect predictive model drift, using newer data and an existing model.

Parameter tuning during training and validation is an optimization problem, where our goal is to find the combination of parameter values that gives us the lowest generalization error.

Depending on the number of parameters and the values for the parameters, you may be up against a combinatorial explosion problem, running into millions of possible combination of parameters values.

Grid search through the parameter space is not practical for such scenario. Generally this is done with grid search or random search optimization algorithms with Machine Learning libraries.

I am working on various stochastic optimization algorithms for parameter tuning. The user will be able to choose the parameter optimization technique desired with appropriate configuration.

Configuration

With the framework and the provided driver code in avenir, you can use ScikitLearn predictive modeling algorithms without writing any python code. A comprehensive property file based configuration makes this possible.

The configuration  parameters are divided into multiple groups as below. Except for common, each group has direct correspondence to the framework methods listed above.

  • common : These configuration parameters algorithmic agnostic and are required for all of the frame works methods
  • train: Contains configuration parameters for train() and tranValidate() methods. Not all parameters under this group gets used by tarin() or trainValidate()
  • predict: Contains configuration parameters for predict() when the model gets deployed in production
  • validate: Contains configuration parameters for  validate() used to detect model drift, after it’s been deployed and in use

Here is the complete list of configuration with explanation. Each configuration parameter name is prefixed with the group names listed above. The values are are to be treated as sample. You are free to change them. Gradient Boosting related parameters are indicated along with corresponding ScikitLearn parameter names

Default value is indicated by _. You can also use None, to indicate that no value is specified for a parameter. If a configuration parameter is mandatory, there is no default and it’s not provided, an exception gets thrown.

Name and Value Comment
common.mode = trainValidate mode of execution
common.model.directory = model model save directory
common.model.file = crm_gb_model saved model file name
common.preprocessing = _ pre processing steps
train.data.file = leads_5000.txt input data file name
train.data.fields = 0,1,2,3 etc. coma separated list of column indexes
train.data.feature.fields = 0,1,2 etc coma separated list of feature column indexes
 train.data.class.field = 17  class field index
 train.validation = kfold  cross validation method
 train.num.folds = 5  number of folds
 train.min.samples.split = 4  GBT  specific (min_samples_split)
 train.min.samples.leaf = 4  GBT  specific (min_samples_leaf)
 train.min.weight.fraction.leaf = 0.1  GBT  specific (min_weight_fraction_leaf)
 train.max.depth = 3  GBT  specific (max_depth)
 train.max.leaf.nodes = None  GBT  specific (max_leaf_nodes)
 train.max.features = _  GBT  specific (max_features)
 train.learning.rate = 0.10  GBT  specific (learning_rate)
 train.num.estimators = 100  GBT  specific (n_estimators)
 train.subsample = _  GBT  specific (subsample)
 train.loss = _  GBT  specific (loss)
 train.init = _  GBT  specific (init)
 train.random.state = 100  GBT  specific (random_state)
 train.verbose = _  GBT  specific (verbose)
 train.warm.start = _  GBT  specific (warm_start)
 train.presort = _  GBT  specific (presort)
 train.criterion = _  GBT specific
 train.success.criterion = error  whether to output performance metric or it’s inverse
 train.model.save = False  whether to save model
 train.score.method = accuracy  GBT specific
 train.search.param.strategy=guided  parameter tuning optimization strategy
 train.search.params = train.learning.rate:float, etc parameters to be used for parameter tuning
 predict.data.file = leads_1000.txt  input file for prediction
 predict.data.fields = 1,2  coma separated list of column indexes
 predict.data.feature.fields = 0,1, etc  coma separated list of feature column indexes
 predict.use.saved.model = True  whether saved trained model should be used
 validate.data.file = leads_5000.txt  input file for validation
 validate.data.fields=1,2, etc  coma separated list of column indexes
 validate.data.feature.fields=0,1, etc  coma separated list of feature column indexes
 validate.data.class.field = 17  class field index
 validate.use.saved.model = False  whether saved trained model should be used
 validate.score.method = confusionMatrix  performance metric

This article provides good guidance and details on configuration parameters for Gradient Boosted Trees in ScikitLearn.

Parameter Space Search for Optimum Tuning

When the mode is trainValidate and the parameter train.search.param.strategy is set, then it will do search through the parameters space to find optimum combination of parameter values.

The parameters to be included in search space needs to be provided as a coma separated list through the parameter train.search.params. For all the parameters specifed in train.search.params, the corresponding parameters should have a list of coma separted values, instead of one.

Currently only guided search is supported, where the user needs to provide all the values for a parameter to be included in the search. I am working on implementing and supporting few other stochastic optimization algorithms.

Machine Learning Commandments

In building optimal predictive model, we have the following two free parameters to play with

  1. Training data size
  2. Model complexity

Sometimes you are limited with a maximum training data size. In that case you take the largest training data size and play around with the model complexity parameters.

The relationship between the training data size, model complexity and error rate is complex and is characterized as follows.

  1. For a given model complexity, training error increases with training data size,  asymptotically approaching the true error.
  2. For a given model complexity, test or generalization  error decreases with training data size, asymptotically approaching the true error.
  3. If the difference between the training and test error is large even with the largest training data set you have, you may need more training data for the two errors to converge.
  4. If the training error and test error have converged with but with high error value, you have a simple model with not enough complexity. You need to increase the model complexity.
  5. For a given training data size, training error decreases with model complexity
  6. For a given training data size, test error  decreases with model complexity up to a point of optimal complexity and then starts increasing.
  7. The optimal complexity of a model increases with training data size and then reaches a plateau, beyond which  additional training data does not make any difference, because the model has achieved sufficient complexity

Predictive Model Training Workflow

Based on our knowledge of the interplay between training data size, model complexity and error rate, we can define the following workflow for building predictive models.

  1. For some model complexity,  train models with increasing data size and find the data size where the error rate seems to plateau. In this step you may be limited by the maximum available data size.
  2. If the training error rate is unacceptable, increase model complexity and repeat step 1. Again you may be limited by the maximum amount of available training data.
  3. For the data size from the previous step, train and validate model using parameter search. Perturb some key parameters around the fixed set of values used in step 2. Find the optimal parameters.
  4. If there is a large gap between test and training error, go back to step 1 with model complexity obtained from step 3 and repeat from step 1 onward

Results form Training a Model

For the training phase, for some initial model complexity parameters, I trained the model with training data size of 2500, 5000 and 10000. Here are the results with training error.

2500
running mode: train
...building model
...training model
error with training data 0.043

5000
running mode: train
...building model
...training model
error with training data 0.054

10000
running mode: train
...building model
...training model
error with training data 0.057

Training error rate seems to level off for a data size of 5000.  This step corresponds to step 1 above. The two key parameters that we will use for training and validation with parameter search are learning rate and the number of tree instances. Their values for the training phase is below.

train.learning.rate=0.10
train.num.estimators=100

Next, we will perform  train and  validate with  k fold  cross validation using training data size of 5000 which will correspond to step 3. We will consider 3 possible values for each of the 2 parameters, resulting in 9 combinations as below. I chose those two among many, because they seem to to be the most critical parameters.

train.learning.rate=0.04,0.07,0.12
train.num.estimators=40,70,120

Here are the results for 9 possible combinations of the 2 parameters. It also shows the parameter value combination corresponding to smallest error rate

all parameter search results
train.learning.rate=0.04  train.num.estimators=40   0.126
train.learning.rate=0.04  train.num.estimators=70   0.114
train.learning.rate=0.04  train.num.estimators=120   0.098
train.learning.rate=0.07  train.num.estimators=40   0.114
train.learning.rate=0.07  train.num.estimators=70   0.096
train.learning.rate=0.07  train.num.estimators=120   0.076
train.learning.rate=0.12  train.num.estimators=40   0.093
train.learning.rate=0.12  train.num.estimators=70   0.078
train.learning.rate=0.12  train.num.estimators=120   0.063
best parameter search result
train.learning.rate=0.12  train.num.estimators=120   0.063

The generalization error of 0.063 is acceptable, and it’s 17% more than the training error of 0.054. The optimal parameter values for the 2 parameters for lowest generalization error is slightly different from what I used for training. The training phase values for the 2 parameters are ScikitLearn default values.

I ran it in the train mode again with optimal values of the 2 parameters we found from the train and validate. Here is the result

running mode: train
...building model
...training model
error with training data 0.043

Interestingly, the gap between the train and generalization error increased. Now the test or generalization error is 46% more than the training error. According to commandment #3 as above, we need more training data e.g. 6000 or 7000 and start over. I haven’t done it. If it piques your curiousity, you could try.

My parameter search space consisted of only 2 parameters. By no means, can I claim that I have the optimal parameter values for the lowest generalization error, because the search space  not exhaustive enough. If you are curious, you could include more parameters and and see if you can find more optimum parameter values.

Final Comments

In predictive modeling, there is a complex and nonlinear relationship between model complexity, training data size and the generalization error.  We need a model complex enough to reflect the the complexity of the underlying process that generates the data. For a model with given complexity we need enough training data. Finding the optimal model is an iterative process.

In this post we have focussed on training the predictive model. In a future post, I will discuss the other life cycle phases of a model development i.e  production deployment for  prediction, model drift and retraining.

The tutorial document has the details on how to generate the data and execute the Python driver code to call the GBT wrapper class methods.

ThirdEye Data

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.