Predict Employee Leave
In this tutorial, you will learn how to employ a simulated dataset from Kaggle to build a machine learning model to both predict and explain whether employees will leave their employer or not and the reason(s) why they may do so. The data comprise a wide range of topics which allow to explain employees’ leave behavior in relation with A) organizational factors (department); B) employment relational factors (i.e. tenure, the number of projects participated in; the average working hours per month; objective career development; salary); and C) job-related factors (performance evaluation; involvement in workplace accidents).
This tutorial has the objective to inspire you to explore the possibilities of using machine learning for your own research.
You will follow several steps to explore the data and build a machine learning model to predict whether an employee will leave or not, and why.
- Step 1: Get a first understanding of the data
- Step 2: Create the Experiment
- Step 3: Prepare a training and a test set
- Step 4: Train the model
- Step 5: Score the test set
- Step 6: Evaluate the results
- Step 7: Gain insights on the why
You will build this prediction model with the Azure Machine Learning Studio. The complete model will look like this:
Prerequisites: Get Access to Azure Machine Learning Studio
There are several options to start with Azure ML. The easiest way is to got to https://azure.microsoft.com/en-us/services/machine-learning/ and click on the Get started now button.
Hereafter, you can select the Free Workspace option. You will need a Windows LiveID to sign in. If you don’t have one, you can sign up here: https://signup.live.com/
Step 1: Get a first understanding of the data
You can download the data at Kaggle https://www.kaggle.com/lnvardanyan/hr-analytics/data and save it as turnover.csv.
Note: If you have trouble obtaining the data, you can also start with the starting experiment from the Cortana Intelligence Gallery. You would have to open the experiment in your studio. After this you can skip a few instructions (see *).
For those that have downloaded the data, we can continue inspecting the dataset.
We have the following available variables in the dataset:
Employment relational factors
- Time spent at the company
- Number of projects
- Average monthly hours
- Whether they have had a promotion in the last 5 years
- Last evaluation
- Whether they have had a work accident
- Whether the employee has left
Step 2: Create the Experiment
Open a browser and browse to https://studio.azureml.net. Then sign in using the Microsoft account associated with your Azure ML. Create a new blank experiment by clicking on the + NEW button in the left of your browser, and select EXPERIMENT, and subsequently BLANK EXPERIMENT. You can change the generated name into Predict Employee Leave.
The next step is to upload the turnover.csv file to Azure ML and name it Employee Leave data. To do this, you have to click on the + NEW button in the left lower corner of your browser, and select DATASET, and subsequently FROM LOCAL FILE.
In the Predict Employee Leave experiment, you can go to My Datasets under Saved datasets, and drag the Employee Leave data on the canvas.
(* if you have started with the starting experiment, you can continue here) To get a first impression of the data, you can right-click the output port of the dataset to visualize the data. You can scroll through the different columns, and by selecting them, you get an overview in the panel on the right.
Another way to get a first impression of the data. Therefore we use the Summarize Data module, which gives us insights about the data.
You have to RUN the model, and right-click on the output port of the Summarize Data module and select Visualize. We see that we have 14999 observations, and that we don’t miss any data. We also get an idea about the variance and distribution of the data.
Step 3: Prepare a training and a test set
We split the dataset into a training and a test set, using 70% of the data to train the model with, and 30% of the data to test the model later on. Therefore we drag the Split Data module on the canvas, and connect the output port of the dataset to the inport port of the Split Data module. We set a seed, so we can repeat this experiment.
Step 4: Train the model
Since we have split the data, we can continue to work with the training data set. We first select the Train model module and drag it on the canvas. But when we do so you will a little red exclamation mark. This is because we haven’t selected the variable that we want to predict and we haven’t defined the algorithm that we want to use to train the model with. First we will select the dependent variable. Therefore, we have to click on the Launch column selector.
In order to set the dependent variable, we select the variable “left” (indicating whether an employee has left or not) from AVAILABLE COLUMNS and use the arrow button to get it to the right side, under “SELECTED COLUMNS”.
Furthermore, we have to select the algorithm to train the model with. In this experiment we use the Two-Class Boosted Decision Tree algorithm with the standard parametrization. We do add a seed to make this experiment replicable.
Step 5: Score the test set
After this, we are prepared to score the test set and see how our model performs. Therefore, we use the Score Model module and we connect both the output port of the Train model module, which contains the trained model, as the outcome of the Split Data set, containing the test data.
Step 6: Evaluate the results
Finally, it’s time to evaluate the results of our model. We use the Evaluate Model module which we connect to the results of our prior scoring.
Let’s run the model, and then right click on the Evaluate Model module to visualize the results. We can predict with 98% accuracy and 98% precision.
Step 7: Gain insights on the why
Our final question was why employees were leaving. Therefore, we could add the Permutation Feature Importancy module. We connect the output port of the Train Model module and the output port of the Split Data module. Now we can compute the permutation feature importance scores of feature variables given this trained model and the test dataset. We set a seed to make the experiment replicable, and we focus on accuracy, meaning that we are both interested in selected correctly the people that leave, and the people that will not leave.
If we run the model, and right-click on the output port of the Permutation Feature Importance module, we find that satisfaction was one of the main factors when leaving, according to this dataset.
Of course there is much information missing. We don’t know anything about the dates of the obtained data, nor do we know anything between the data gathering and the moment that the employee left.
As mentioned before, this tutorial is created to inspire you. If for whatever reason you were struggling to get the model built, you can also download the complete model from the Cortana Intelligence Gallery.
We hope you enjoyed this tutorial. Please feel free to leave us your comments!