Predicting CRM Lead Conversion with Gradient Boosting using ScikitLearn

Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively providing special care for weak leads and for projecting future revenue .

In this post we will go over a predictive modeling solution built on Python ScikitLearn Machine Learning Library. We will be using Gradient Boosted Tree(GBT) which is a powerful and popular supervised learning algorithm.

During the course of this blog, we will also see how the abstraction layer I have built around Scikit supervised learning algorithms works. The abstraction layer along with property file based configuration management makes model building and model life cycle management significantly easier. The implementation can be found in my OSS project avenir.

Gradient Boosted Trees

Boosting is an ensemble technique in Machine Learning. Ensemble is a way to combine multiple simple or weak models to create more powerful models. The difference between different boosting algorithms depends on how the simple models are combined. Here are the main steps in boosting.

Build an initial simple model
Build a next model based on the prediction accuracy of the models built so far, taken together. This step addresses the shortcomings of the simple models built so far
Repeat step 2 until the stopping condition is reached

With Gradient Boosted Trees(GBT), a numerical optimization is performed where the objective is to minimize loss of the ensemble by adding simple or base learners using gradient descent procedure. The simple or base learners for GBT are regression trees, which are added to the ensemble in an additive way. Since the gradient is taken w.r.t to the base learner or function, the procedure is also known as Functional Gradient Boosting (FGD).

The shortcomings of the existing ensemble while adding a new simple model is measured by the gradient of the loss. To be more specific, the parameters for the new base model to be added, is chosen such that loss is reduced, moving in the direction of the negative gradient of the loss function.

Sales Lead Data from CRM

We will be using sales lad data from a fictitious CRM system with the following attributes as our use case.

id
source of lead e.g trade show, web down load etc
lead contact type e.g has recommendation authority
lead company size
number of days so far in CRM pipeline
number of meetings so far with lead and others in lead’s company
number of emails exchanged so far with lead and others in lead’s company
number of web site visits so far by the lead and and others in lead’s company
number of demos so far
expected revenue from the deal
whether proposal with price quote sent
whether the lead converted

Since most of the feature attributes are time dependent i.e time spent in the sales pipeline, the prediction also depends on the time spent in the pipeline.

The first attribute is an ID which is ignored. The last attribute is the target or class label. The rest are feature attributes. Here is some sample data.

7E561S3X62,referral,canReccommend,large,50,7,14,4,3,49679,N,1
RH19V26CX5,tradeShow,canDecide,large,58,6,10,8,4,46127,N,0
3GXMOW46MA,referral,canReccommend,large,84,6,8,3,3,30000,N,1
3WYD4CY31A,advertisement,canDecide,medium,42,9,5,2,2,44659,N,0
GUXRCLV835,webDownload,canReccommend,large,43,4,8,1,3,46172,N,0
9XXGCOBGWR,webDownload,canDecide,small,23,5,12,1,4,45789,N,0
SGZS58ESSU,webDownload,canReccommend,medium,67,6,12,6,3,38449,N,0

Predictive Modeling Framework

To make it easier and to keep it inline with the predictive model development life cycle process, I have created an abstraction wrapper class on top of ScikitLearn suoervised learning algorithms. The API of the abstraction class consists of these essential methods, each of them corresponding to a phase in the predictive model development life cycle.

train() : Builds predictive model and reports the training error. Used to decide the trade off between model complexity and training data size, by keeping training error within acceptable limit.
trainValidate(): Build predictive model, cross validates and reports test or generalization error. Uses K Fold cross validation or comparable techniques. You can do parameter tuning to minimize generalization error by searching the parameter space.
predict(): Makes prediction. This is the method that get called when the predictive model is deployed for use.
validate(): Makes prediction using existing predictive model and newer data and reports error. Used to detect predictive model drift, using newer data and an existing model.

Parameter tuning during training and validation is an optimization problem, where our goal is to find the combination of parameter values that gives us the lowest generalization error.

Depending on the number of parameters and the values for the parameters, you may be up against a combinatorial explosion problem, running into millions of possible combination of parameters values.

Grid search through the parameter space is not practical for such scenario. Generally this is done with grid search or random search optimization algorithms with Machine Learning libraries.

I am working on various stochastic optimization algorithms for parameter tuning. The user will be able to choose the parameter optimization technique desired with appropriate configuration.

Configuration

With the framework and the provided driver code in avenir, you can use ScikitLearn predictive modeling algorithms without writing any python code. A comprehensive property file based configuration makes this possible.

The configuration parameters are divided into multiple groups as below. Except for common, each group has direct correspondence to the framework methods listed above.

common : These configuration parameters algorithmic agnostic and are required for all of the frame works methods
train: Contains configuration parameters for train() and tranValidate() methods. Not all parameters under this group gets used by tarin() or trainValidate()
predict: Contains configuration parameters for predict() when the model gets deployed in production
validate: Contains configuration parameters for validate() used to detect model drift, after it’s been deployed and in use

Here is the complete list of configuration with explanation. Each configuration parameter name is prefixed with the group names listed above. The values are are to be treated as sample. You are free to change them. Gradient Boosting related parameters are indicated along with corresponding ScikitLearn parameter names

Default value is indicated by _. You can also use None, to indicate that no value is specified for a parameter. If a configuration parameter is mandatory, there is no default and it’s not provided, an exception gets thrown.

Name and Value	Comment
common.mode = trainValidate	mode of execution
common.model.directory = model	model save directory
common.model.file = crm_gb_model	saved model file name
common.preprocessing = _	pre processing steps
train.data.file = leads_5000.txt	input data file name
train.data.fields = 0,1,2,3 etc.	coma separated list of column indexes
train.data.feature.fields = 0,1,2 etc	coma separated list of feature column indexes
train.data.class.field = 17	class field index
train.validation = kfold	cross validation method
train.num.folds = 5	number of folds
train.min.samples.split = 4	GBT specific (min_samples_split)
train.min.samples.leaf = 4	GBT specific (min_samples_leaf)
train.min.weight.fraction.leaf = 0.1	GBT specific (min_weight_fraction_leaf)
train.max.depth = 3	GBT specific (max_depth)
train.max.leaf.nodes = None	GBT specific (max_leaf_nodes)
train.max.features = _	GBT specific (max_features)
train.learning.rate = 0.10	GBT specific (learning_rate)
train.num.estimators = 100	GBT specific (n_estimators)
train.subsample = _	GBT specific (subsample)
train.loss = _	GBT specific (loss)
train.init = _	GBT specific (init)
train.random.state = 100	GBT specific (random_state)
train.verbose = _	GBT specific (verbose)
train.warm.start = _	GBT specific (warm_start)
train.presort = _	GBT specific (presort)
train.criterion = _	GBT specific
train.success.criterion = error	whether to output performance metric or it’s inverse
train.model.save = False	whether to save model
train.score.method = accuracy	GBT specific
train.search.param.strategy=guided	parameter tuning optimization strategy
train.search.params = train.learning.rate:float, etc	parameters to be used for parameter tuning
predict.data.file = leads_1000.txt	input file for prediction
predict.data.fields = 1,2	coma separated list of column indexes
predict.data.feature.fields = 0,1, etc	coma separated list of feature column indexes
predict.use.saved.model = True	whether saved trained model should be used
validate.data.file = leads_5000.txt	input file for validation
validate.data.fields=1,2, etc	coma separated list of column indexes
validate.data.feature.fields=0,1, etc	coma separated list of feature column indexes
validate.data.class.field = 17	class field index
validate.use.saved.model = False	whether saved trained model should be used
validate.score.method = confusionMatrix	performance metric

This article provides good guidance and details on configuration parameters for Gradient Boosted Trees in ScikitLearn.

Parameter Space Search for Optimum Tuning

When the mode is trainValidate and the parameter train.search.param.strategy is set, then it will do search through the parameters space to find optimum combination of parameter values.

The parameters to be included in search space needs to be provided as a coma separated list through the parameter train.search.params. For all the parameters specifed in train.search.params, the corresponding parameters should have a list of coma separted values, instead of one.

Currently only guided search is supported, where the user needs to provide all the values for a parameter to be included in the search. I am working on implementing and supporting few other stochastic optimization algorithms.

Machine Learning Commandments

In building optimal predictive model, we have the following two free parameters to play with

Training data size
Model complexity

Sometimes you are limited with a maximum training data size. In that case you take the largest training data size and play around with the model complexity parameters.

The relationship between the training data size, model complexity and error rate is complex and is characterized as follows.

For a given model complexity, training error increases with training data size, asymptotically approaching the true error.
For a given model complexity, test or generalization error decreases with training data size, asymptotically approaching the true error.
If the difference between the training and test error is large even with the largest training data set you have, you may need more training data for the two errors to converge.
If the training error and test error have converged with but with high error value, you have a simple model with not enough complexity. You need to increase the model complexity.
For a given training data size, training error decreases with model complexity
For a given training data size, test error decreases with model complexity up to a point of optimal complexity and then starts increasing.
The optimal complexity of a model increases with training data size and then reaches a plateau, beyond which additional training data does not make any difference, because the model has achieved sufficient complexity

Predictive Model Training Workflow

Based on our knowledge of the interplay between training data size, model complexity and error rate, we can define the following workflow for building predictive models.

For some model complexity, train models with increasing data size and find the data size where the error rate seems to plateau. In this step you may be limited by the maximum available data size.
If the training error rate is unacceptable, increase model complexity and repeat step 1. Again you may be limited by the maximum amount of available training data.
For the data size from the previous step, train and validate model using parameter search. Perturb some key parameters around the fixed set of values used in step 2. Find the optimal parameters.
If there is a large gap between test and training error, go back to step 1 with model complexity obtained from step 3 and repeat from step 1 onward

Results form Training a Model

For the training phase, for some initial model complexity parameters, I trained the model with training data size of 2500, 5000 and 10000. Here are the results with training error.

2500
running mode: train
...building model
...training model
error with training data 0.043

5000
running mode: train
...building model
...training model
error with training data 0.054

10000
running mode: train
...building model
...training model
error with training data 0.057

Training error rate seems to level off for a data size of 5000. This step corresponds to step 1 above. The two key parameters that we will use for training and validation with parameter search are learning rate and the number of tree instances. Their values for the training phase is below.

train.learning.rate=0.10
train.num.estimators=100

Next, we will perform train and validate with k fold cross validation using training data size of 5000 which will correspond to step 3. We will consider 3 possible values for each of the 2 parameters, resulting in 9 combinations as below. I chose those two among many, because they seem to to be the most critical parameters.

train.learning.rate=0.04,0.07,0.12
train.num.estimators=40,70,120

Here are the results for 9 possible combinations of the 2 parameters. It also shows the parameter value combination corresponding to smallest error rate

all parameter search results
train.learning.rate=0.04  train.num.estimators=40  	0.126
train.learning.rate=0.04  train.num.estimators=70  	0.114
train.learning.rate=0.04  train.num.estimators=120  	0.098
train.learning.rate=0.07  train.num.estimators=40  	0.114
train.learning.rate=0.07  train.num.estimators=70  	0.096
train.learning.rate=0.07  train.num.estimators=120  	0.076
train.learning.rate=0.12  train.num.estimators=40  	0.093
train.learning.rate=0.12  train.num.estimators=70  	0.078
train.learning.rate=0.12  train.num.estimators=120  	0.063
best parameter search result
train.learning.rate=0.12  train.num.estimators=120  	0.063

The generalization error of 0.063 is acceptable, and it’s 17% more than the training error of 0.054. The optimal parameter values for the 2 parameters for lowest generalization error is slightly different from what I used for training. The training phase values for the 2 parameters are ScikitLearn default values.

I ran it in the train mode again with optimal values of the 2 parameters we found from the train and validate. Here is the result

running mode: train
...building model
...training model
error with training data 0.043

Interestingly, the gap between the train and generalization error increased. Now the test or generalization error is 46% more than the training error. According to commandment #3 as above, we need more training data e.g. 6000 or 7000 and start over. I haven’t done it. If it piques your curiousity, you could try.

My parameter search space consisted of only 2 parameters. By no means, can I claim that I have the optimal parameter values for the lowest generalization error, because the search space not exhaustive enough. If you are curious, you could include more parameters and and see if you can find more optimum parameter values.

Final Comments

In predictive modeling, there is a complex and nonlinear relationship between model complexity, training data size and the generalization error. We need a model complex enough to reflect the the complexity of the underlying process that generates the data. For a model with given complexity we need enough training data. Finding the optimal model is an iterative process.

In this post we have focussed on training the predictive model. In a future post, I will discuss the other life cycle phases of a model development i.e production deployment for prediction, model drift and retraining.

The tutorial document has the details on how to generate the data and execute the Python driver code to call the GBT wrapper class methods.

Related Blogs:

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye

Predicting CRM Lead Conversion with Gradient Boosting using ScikitLearn

Gradient Boosted Trees

Sales Lead Data from CRM

Predictive Modeling Framework

Configuration

Parameter Space Search for Optimum Tuning

Machine Learning Commandments

Predictive Model Training Workflow

Results form Training a Model

Final Comments

Transforming Enterprises with
Data & AI Services & Solutions.

Primary Services

Pre-Built Applications

Data & AI Solutions

Get Exclusive Insights

Insights

Talk To Us

Predicting CRM Lead Conversion with Gradient Boosting using ScikitLearn

Gradient Boosted Trees

Sales Lead Data from CRM

Predictive Modeling Framework

Configuration

Parameter Space Search for Optimum Tuning

Machine Learning Commandments

Predictive Model Training Workflow

Results form Training a Model

Final Comments

Transforming Enterprises with Data & AI Services & Solutions.

Share This Article

Related Posts

How AI Can Prevent Devastation Like the Recent Los Angeles Wildfires

A Comprehensive Guide on Latent Dirichlet Allocation

Demystifying Physics with Deep Learning: A Guide to Physics-Informed Neural Networks (PINNs) with Python

Time Series Classification with Neural Network using Random Sub Sequence Statistics as Features

Primary Services

Pre-Built Applications

Data & AI Solutions

Get Exclusive Insights

Insights

Talk To Us

Transforming Enterprises with
Data & AI Services & Solutions.