Блог про HR-аналитику: Employee turnover: how to predict individual risks of quitting

понедельник, 23 октября 2017 г.

Employee turnover: how to predict individual risks of quitting

ПРИГЛАШАЮ ОТСЛЕЖИВАТЬ НАС В ТЕЛЕГРАМ

There are several approaches to predict Employee turnover (see Analyzing Employee Turnover - Predictive Methods, for russian-speaking readers Анализ текучести персонала – Методы прогнозирования).
Survival Analysis is one of the most importance but it's not the most popular algoritm to predict employee turnover.
Analysts use more familiar algorithms like Logistic Regression but, for example, Pasha Roberts writes: "Don't use logistic methods to predict attrition!". I think that we can only apply for a short-term situation like whether the employee has worked more or less than three months. If our goal is to predict individual quitting risks, then the best method is Survival Analysis.
The problem is that since it is not popular, many do not understand how to apply for Survival Analysis such actions as train-test splitting, cross-validation, tuning of hyperparameters.
I want to show how I do it.
If you want to train with my code you can use your own dataset or mine turnover.csv (link from dropbox). Variables:

"stag" - experience;
"event" - event;
"gender"
"age"
"industry"
"profession"
"traffic" - From what pipelene candidate came to the company;
"coach" - presence of a coach on probation;
"head_gender"
"greywage" - The salary does not seem to the tax authorities;
"way" - how an employee gets to workplace (by feet, by bus etc);
"extraversion", "independ", "selfcontrol", "anxiety", "novator" - Big5 test scales.

The dataset is real. I drop some iterations like scaling because I save space.

The Code
Packages and uploading

I will be glad to any feedback

library(mlr)
library(survival)
library(pec)
library(survAUC)
library(dplyr)
library(reshape2)
library(ggplot2)

data = read.csv("turnover.csv", header = TRUE, sep = ",", na.strings = "")

As you can see I use survival but you can apply this code for every package of Survival Analysis (for example I prefer "randomForestSRC").
Train-test splitting

train = sample(nrow(data), 0.7 * nrow(data))
test = setdiff(seq_len(nrow(data)), train)
train.task = makeSurvTask(data = data[train, ], target = c("stag", "event"))
train.task
test.task = makeSurvTask(data = data[test, ], target = c("stag", "event"))
test.task

set makeleaner

lrn = makeLearner("surv.coxph", id = "cph")

set a grid of hyperparameters

surv_param = makeParamSet(
  makeDiscreteParam("ties",  values = c('efron','breslow','exact')),
  makeIntegerParam("iter.max", lower = 1, upper = 150),
  makeIntegerParam("outer.max", lower = 1, upper = 50)
)

We can know all hyperparameters from

getParamSet("surv.coxph")

Parameters of tunecontrol and cross-validation

rancontrol = makeTuneControlRandom(maxit = 10L)
set_cv = makeResampleDesc("RepCV", folds = 5L, reps = 5L)

And the most delicious - tuning hyperparameters

surv_tune = tuneParams(learner = rfsrc.lrn, resampling = set_cv, task = train.task,
par.set = surv_param, control = rancontrol)

Now we get the most suitable parameters

surv_tune$x
$ties
[1] "efron"
$iter.max
[1] 6
$outer.max
[1] 46

And the Performance measure of quality of the model

surv_tune$y
cindex.test.mean 
        0.600787

cindex = Concordance index (like ROC AUC). And 0.60 is not so good. Then we train on test data

surv.tree = setHyperPars(lrn, par.vals = surv_tune$x)
surva = mlr::train(surv.tree, test.task)
getLearnerModel(surva)
model = predict(surva, test.task)
model
performance(predict(surva, newdata = data[test, ]))
cindex
0.653796

more better. So also we can check the performance measure with

library(risksetROC)

w.ROC = risksetROC(Stime = data[test, ]$stag, 
                   status = data[test, ]$event,
                   marker = model$data$response,
                   predict.time = median(data[test, ]$stag),
                   method = "Cox",
                   main = paste("OOB Survival ROC Curve at t=",
                                median(model$data$truth.time)),
                   lwd = 3,
                   col = "red" )

w.ROC$AUC

But our goal is to predict individual risks of quitting of specific applicants.

mod = coxph(Surv(stag, event) ~ ., data = data[train, ], ties = "efron", iter.max = 6,
            outer.max = 46) #we apply the received parameters

We take "new" applicants and predict risks

newdudes = data[test, ][c(4,5,12), ]
e = predictSurvProb(mod, newdata = newdudes, times = data[test, "stag"])
quantile(e, na.rm=TRUE)

and visualisations

df = as.data.frame(t(e))
df = rename(df,  Johnson='6', Peterson='11',Sidorson='38')
df$time = as.integer(row.names(df))
test_data_long =melt(df, id="time")  # convert to long format
ggplot(data=test_data_long,
       aes(x=time, y=value, colour=variable)) +
  geom_line(size=1.5)+
  theme_grey(base_size = 30)+ theme(legend.position="bottom",legend.title = element_blank())+ ylab("Prob")

Employee turnover: how to predict individual risks of quitting

So Sidorson is the best in terms of staff turnover. He has the lowest risks of quitting.
I will be glad to any feedback

ПРИГЛАШАЮ ОТСЛЕЖИВАТЬ НАС В ТЕЛЕГРАМ

Блог про HR-аналитику

.

Сделать репост в соц сети!

понедельник, 23 октября 2017 г.

Employee turnover: how to predict individual risks of quitting

The Code
Packages and uploading

Комментариев нет:

Отправить комментарий

.

Сделать репост в соц сети!

понедельник, 23 октября 2017 г.

Employee turnover: how to predict individual risks of quitting

The CodePackages and uploading

Комментариев нет:

Отправить комментарий

понедельник, 23 октября 2017 г.

The Code
Packages and uploading