Imagine you are an HR-Manager, and you would like to know which employees are likely to stay, and which might leave your company. Besides you would like to understand which factors contribute to leaving your company. You have gathered data in the past (well, in this case Kaggle simulated a dataset for you, but just imagine), and now you can start with this Hands On Lab – Predict Employee Leave to build your prediction model to see if that can help you.
In this lab, you will learn how to create a machine learning module that predicts whether an employee will stay or leave your company. We are aware of the limitations of the dataset but the objective of this hands on lab is to inspire you to explore the possibilities of using machine learning for your own research, and not to build the next HR-solution.
We created a starting experiment for you on the Azure AI Gallery to give you a smooth start.
You will follow several steps to explore the data and build a machine learning model to predict whether an employee will leave or not, and why.
- Step 1: Get the starting experiment
- Step 2: Get a first understanding of the data
- Step 3: Prepare a training and a test set
- Step 4: Train the model
- Step 5: Score the test set
- Step 6: Evaluate the results
- Step 7: Gain insights on the why
You will build this prediction model with the Azure Machine Learning Studio. The complete model will look like this:
Prerequisites: Get Access to Azure Machine Learning Studio
There are several options to start with Azure ML. The easiest way is to go to https://azure.microsoft.com/en-us/services/machine-learning/ and click on the “Get started now” button.
You will need a Windows LiveID to sign in. If you don’t have one, you can sign up here: https://signup.live.com/
Hereafter, you can select the Free Workspace option:
Step 1: Get the starting experiment
Now you can start with the starting experiment from the Azure AI Gallery. This experiment uses a simulated dataset from Kaggle. You have to open the experiment in your studio by clicking on the green button “Open in Studio”. This will open the Azure Machine Learning Studio in a browsers, and you can copy the experiment to your free workspace.
Step 2: Inspecting the data
In the Starting experiment: Predict Employee Leave experiment, you will find the Employee Leave data on the canvas, together with a Summarize Data module. If you look at the top corner right, you can see that the experiment is “in draft”. This means that is hasn’t been saved, nor that it has been executed before. Therefore, we start with running the experiment, by clicking on the run button in the menu at the bottom. After running the experiment, the top corner message will change into “Finished running”, and we can start inspecting our data.
To get a first impression of the data, you can right-click the output port of the dataset and select “Visualize” from the menu to visualize the data. The output port is the little circle under every module on the canvas:
We can continue inspecting the dataset. The data comprise a wide range of topics which allow to explain employees’ leave behavior in relation with A) organizational factors (department); B) employment relational factors (i.e. tenure, the number of projects participated in; the average working hours per month; objective career development; salary); and C) job-related factors (performance evaluation; involvement in workplace accidents).
We have the following available variables in the dataset:
Employment relational factors
- Time spent at the company
- Number of projects
- Average monthly hours
- Whether they have had a promotion in the last 5 years
- Last evaluation
- Whether they have had a work accident
- Whether the employee has left
You can right-click on the output port of the Summarize Data module and select Visualize. We see that we have 14999 observations, and that we don’t miss any data. We also get an idea about the variance and distribution of the data.
Step 3: Prepare a training and a test set
We split the dataset into a training and a test set, using 70% of the data to train the model with, and 30% of the data to test the model later on. Therefore we drag the Split Data module on the canvas. You can find this module in the menu left, next to the canvas. You can either click throught the various options, or use the search function.
When you have found the Split Data module, you can drag it on the canvas, and connect the output port of the dataset to the inport port of the Split Data module. You can connect the modules by left-clicking on the output port, and keep you mouse button down while draging it to the module you want to connect it to. We set a seed, so we can repeat this experiment.
Step 4: Train the model
Since we have split the data, we can continue to work with the training data set. We first select the Train model module and drag it on the canvas. But when we do so you will a little red exclamation mark. This is because we haven’t selected the variable that we want to predict and we haven’t defined the algorithm that we want to use to train the model with. First we will select the dependent variable. Therefore, we have to click on the Launch column selector.
In order to set the dependent variable, we select the variable “left” (indicating whether an employee has left or not) from AVAILABLE COLUMNS and use the arrow button to get it to the right side, under “SELECTED COLUMNS”.
Furthermore, we have to select the algorithm to train the model with. In this experiment we use the Two-Class Boosted Decision Tree algorithm with the standard parametrization. We do add a seed to make this experiment replicable.
Step 5: Score the test set
After this, we are prepared to score the test set and see how our model performs. Therefore, we use the Score Model module and we connect both the output port of the Train model module, which contains the trained model, as the outcome of the Split Data set, containing the test data.
Step 6: Evaluate the results
Finally, it’s time to evaluate the results of our model. We use the Evaluate Model module which we connect to the results of our prior scoring.
Let’s run the model, and then right click on the Evaluate Model module to visualize the results. We can predict with 98% accuracy and 98% precision.
Step 7: Gain insights on the why
Our final question was why employees were leaving. Therefore, we could add the Permutation Feature Importancy module. We connect the output port of the Train Model module and the output port of the Split Data module. Now we can compute the permutation feature importance scores of feature variables given this trained model and the test dataset. We set a seed to make the experiment replicable, and we focus on accuracy, meaning that we are both interested in selected correctly the people that leave, and the people that will not leave.
If we run the model, and right-click on the output port of the Permutation Feature Importance module, we find that satisfaction was one of the main factors when leaving, according to this dataset. Next to that, the number of project an employee got was important.
Of course there is much information missing. We don’t know anything about the dates of the obtained data, nor do we know anything between the data gathering and the moment that the employee left.
As mentioned before, this hands on lab is created to inspire you. If for whatever reason you were struggling to get the model built, you can also download the complete model from the Azure AI Gallery.
If you want to know more about how to work with the Azure Machine Learning Studio, we would like to invite you to take a look at the Principles of Machine Learning course, offered by DataChangers. And if you want to learn more about data science in general, please check out the Microsoft Professional Program – Data Science Track.
Do you want to become a data scientist yourself? Then check out our Future Skills Lab – Data Science Program!
We hope you enjoyed this tutorial. Please feel free to leave us your comments!