In this blog about human resources analytics, we are building a model to predict whether an employee will leave or not, and we will also try to find out why they leave, according to the data. We will use a simulated dataset from Kaggle, which can be found here: https://www.kaggle.com/ludobenistant/hr-analytics
Fields in the dataset include:
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Whether the employee has left
First we will build the model to predict whether an employee will leave or not. Secondly, we will look at why, according to the data, this employee would be leaving.
The complet model will look like this:
Step 1: Get a first understanding of the data
After selecting the dataset, we first want to get a first impression of the data. Therefore we use the Summarize Data module, which gives us insights about the data. We see that we have 14999 observations, and that we don’t miss any data. We also get an idea about the variance and distribution of the data.
Step 2: Prepare a training and a test set
We split the dataset with the Split Data module into a training and a test set, using 70% of the data to train the model with, and 30% of the data to test the model later on. We set a seed, so we can repeat this experiment.
Step 3: Train the model
In this experiment we use the Two-Class Boosted Decision Tree algorithm with the standard parametrization. We do add a seed to make this experiment replicable.
With this algorithm, we train the model on the column “left”.
Step 4: Score the test set
Now we are prepared to use the Score Model module and score the test set.
Step 5: Evaluate the results
Finally, we use the Evaluate Model module to evaluate our model by using the results of our prior scoring. We can predict with 98% accuracy and 98% precision.
Step 6: Gain insights on the why
Our final question was why employees were leaving. Therefore, we could use the Permutation Feature Importancy module. We set a seed to make the experiment replicable, and we focus on accuracy.
We find that satisfaction was one of the main factors when leaving, according to this dataset.
We hope you enjoyed this blog. Please feel free to leave us your comments!