The Azure ML sample experiment Binary Classification: Customer relationship prediction shows us how we can use Azure’s binary classification algorithms. In the original Microsoft sample experiment, the models predict a customer’s churn, appetency, and upselling target variables. In this blog, I’m only focusing on the upselling target.
My aim is to give some reflections on the provided sample experiment. I suppose that these kind of sample experiments are created to inspire people, and not to give “perfect” examples, so I would like to share my findings to challenge you to think about the chosen, and – in my opinion- omitted steps in the process to create a prediction model. The comparison experiment can be found on the Cortana Analytics Gallery as well as the original experiment.
The sample experiment
The data consists of two dataset; one containing the variables, and one for the label. In this case, we only look at the label for upselling.
These are the steps:
- All the missing values of the set containing the variables has been replaced with 0.
- Add the label for upselling.
- Split the data in 50%/50%, stratified on the upselling label.
- Train a model with a Two-Class Boosted Decision Tree
- Score the model
- Evaluate the model
First we look at the background of the data. The dataset of this experiment belongs to the KDD Cup 2009. In this case, the small dataset has been used. This set contains 230 variables, of which 190 are numeric, and 40 categorical. The corresponding targets belonging to this dataset where “churn”, “appetency”, and “upselling”. Regarding upselling, one could imagine a question like: “Will this customer be susceptible for upselling?”. With this question in mind, we will start our journey.
Initial inspection and transformation of the data
First of all, there are a lot of missing values. Of the 230 variables, 164 have more than 20% missing values. Of the 230 variables, 190 are numeric, and the last 40 are categorical, but in the original sample, no distinction has been made between the numeric and categorical variables.
REFLECTION 1: CHECK THE DATA
REFLECTION 2: TAKE CARE OF THE DATA TYPES
Interesting in the original sample is that all the missing values have been filled up with value 0. That is quite a decision and no explanation has been given. For the last 40 categorical variables of this dataset, it doesn’t matter so much given the chosen Two-Class Decision Tree model, as “0” will be seen as another category, but for the first 190 numeric variables, replacing all the missing values with “0” can create a bias, given the fact that many variables have a huge amount (over 15%) of missing values. So we decided to use the mode (Acuña & Rodriguez, 2004) to substitute these missing values. We also eliminated the variables that had a constant value. However, we did not remove the variables that had a near zero variance, as they could still be informative.
REFLECTION 3: DECIDE WHAT TO DO WITH MISSING DATA
REFLECTION 4: DECIDE WHAT TO DO WITH CONSTANT VALUES
After dealing with the missing values, we used binning to create bins for the numeric variables.
REFLECTION 5: EXPLORE FEATURE ENGINEERING
Training the model
To make the settings comparable, we chose to use the same algorithm to train the model with, as the original experiment. But we also made a setup to train a model with a Two-Class Logistic Regression. For the latest, we converted the categorical variables into dummy variables first.
REFLECTION 6: TRY DIFFERENT MODELS
The resulting score of both models are quite similar. But, depending on what question you started your journey with, the newly created models gave a better result, as the precision score of the new model (75%) was slightly higher than the original model (66%).
Substituting the missing values with the mode is an option, but there are other ways (i.e. using MICE) to deal with missing values.
REFLECTION 7: BE CRITICAL AND LOOK FOR IMPROVEMENTS
Finally, we would appreciate any feedback on our findings and welcome you to send us your comments!