Share |

понедельник, 23 октября 2017 г.

Employee turnover: how to predict individual risks of quitting



There are several approaches to predict Employee turnover (see Analyzing Employee Turnover - Predictive Methods, for russian-speaking readers Анализ текучести персонала – Методы прогнозирования).
Survival Analysis is one of the most importance but it's not the most popular algoritm to predict employee turnover.
Analysts use more familiar algorithms like Logistic Regression but, for example, Pasha Roberts writes: "Don't use logistic methods to predict attrition!". I think that we can only apply for a short-term situation like whether the employee has worked more or less than three months. If our goal is to predict individual quitting risks, then the best method is Survival Analysis.
The problem is that since it is not popular, many do not understand how to apply for Survival Analysis such actions as train-test splitting, cross-validation, tuning of hyperparameters.
I want to show how I do it.
If you want to train with my code you can use your own dataset or mine turnover.csv (link from dropbox). Variables:
  • "stag" - experience; 
  • "event" - event;      
  • "gender"       
  • "age"          
  • "industry"     
  • "profession"  
  • "traffic" - From what pipelene candidate came to the company;
  • "coach" - presence of a coach on probation;
  • "head_gender" 
  • "greywage" -  The salary does not seem to the tax authorities;
  • "way" -   how an employee gets to workplace (by feet, by bus etc);  
  • "extraversion", "independ", "selfcontrol", "anxiety", "novator" - Big5 test scales.
The dataset is real. I drop some iterations like scaling because I save space.

The Code
Packages and uploading 

I will be glad to any feedback
library(mlr)
library(survival)
library(pec)
library(survAUC)
library(dplyr)
library(reshape2)
library(ggplot2)
data = read.csv("turnover.csv", header = TRUE, sep = ",", na.strings = "")
As you can see I use survival but you can apply this code for every package of Survival Analysis (for example I prefer "randomForestSRC").
Train-test splitting
train = sample(nrow(data), 0.7 * nrow(data))
test = setdiff(seq_len(nrow(data)), train)
train.task = makeSurvTask(data = data[train, ], target = c("stag", "event"))
train.task
test.task = makeSurvTask(data = data[test, ], target = c("stag", "event"))
test.task
set makeleaner
lrn = makeLearner("surv.coxph", id = "cph")
set a grid of hyperparameters
surv_param = makeParamSet(
  makeDiscreteParam("ties",  values = c('efron','breslow','exact')),
  makeIntegerParam("iter.max", lower = 1, upper = 150),
  makeIntegerParam("outer.max", lower = 1, upper = 50)
)
We can know all hyperparameters from
getParamSet("surv.coxph")
Parameters of tunecontrol and cross-validation
rancontrol = makeTuneControlRandom(maxit = 10L)
set_cv = makeResampleDesc("RepCV", folds = 5L, reps = 5L)
And the most delicious - tuning hyperparameters
surv_tune = tuneParams(learner = rfsrc.lrn, resampling = set_cv, task = train.task,
par.set = surv_param, control = rancontrol)
Now we get the most suitable parameters
surv_tune$x
$ties
[1] "efron"
$iter.max
[1] 6
$outer.max
[1] 46
And the Performance measure of quality of the model
surv_tune$y
cindex.test.mean 
        0.600787
cindex = Concordance index (like ROC AUC). And 0.60 is not so good. Then we train on test data
surv.tree = setHyperPars(lrn, par.vals = surv_tune$x)
surva = mlr::train(surv.tree, test.task)
getLearnerModel(surva)
model = predict(surva, test.task)
model
performance(predict(surva, newdata = data[test, ]))
cindex
0.653796
more better. So also we can check the performance measure with
library(risksetROC)

w.ROC = risksetROC(Stime = data[test, ]$stag, 
                   status = data[test, ]$event,
                   marker = model$data$response,
                   predict.time = median(data[test, ]$stag),
                   method = "Cox",
                   main = paste("OOB Survival ROC Curve at t=",
                                median(model$data$truth.time)),
                   lwd = 3,
                   col = "red" )

w.ROC$AUC
But our goal is to predict individual risks of quitting of specific applicants.
mod = coxph(Surv(stag, event) ~ ., data = data[train, ], ties = "efron", iter.max = 6,
            outer.max = 46) #we apply the received parameters

We take "new" applicants and predict risks
newdudes = data[test, ][c(4,5,12), ]
e = predictSurvProb(mod, newdata = newdudes, times = data[test, "stag"])
quantile(e, na.rm=TRUE)

and visualisations
df = as.data.frame(t(e))
df = rename(df,  Johnson='6', Peterson='11',Sidorson='38')
df$time = as.integer(row.names(df))
test_data_long =melt(df, id="time")  # convert to long format
ggplot(data=test_data_long,
       aes(x=time, y=value, colour=variable)) +
  geom_line(size=1.5)+
  theme_grey(base_size = 30)+ theme(legend.position="bottom",legend.title = element_blank())+ ylab("Prob")

Employee turnover: how to predict individual risks of quitting
So Sidorson is the best in terms of staff turnover. He has the lowest risks of quitting.
I will be glad to any feedback


Комментариев нет:

Отправить комментарий

Популярные сообщения

п