There are several approaches to predict Employee turnover (see Analyzing Employee Turnover - Predictive Methods, for russian-speaking readers Анализ текучести персонала – Методы прогнозирования).
Survival Analysis is one of the most importance but it's not the most popular algoritm to predict employee turnover.
Analysts use more familiar algorithms like Logistic Regression but, for example, Pasha Roberts writes: "Don't use logistic methods to predict attrition!". I think that we can only apply for a short-term situation like whether the employee has worked more or less than three months. If our goal is to predict individual quitting risks, then the best method is Survival Analysis.
The problem is that since it is not popular, many do not understand how to apply for Survival Analysis such actions as train-test splitting, cross-validation, tuning of hyperparameters.
I want to show how I do it.
If you want to train with my code you can use your own dataset or mine turnover.csv (link from dropbox). Variables:
The Code
Train-test splitting
set makeleaner
Survival Analysis is one of the most importance but it's not the most popular algoritm to predict employee turnover.
Analysts use more familiar algorithms like Logistic Regression but, for example, Pasha Roberts writes: "Don't use logistic methods to predict attrition!". I think that we can only apply for a short-term situation like whether the employee has worked more or less than three months. If our goal is to predict individual quitting risks, then the best method is Survival Analysis.
The problem is that since it is not popular, many do not understand how to apply for Survival Analysis such actions as train-test splitting, cross-validation, tuning of hyperparameters.
I want to show how I do it.
If you want to train with my code you can use your own dataset or mine turnover.csv (link from dropbox). Variables:
- "stag" - experience;
- "event" - event;
- "gender"
- "age"
- "industry"
- "profession"
- "traffic" - From what pipelene candidate came to the company;
- "coach" - presence of a coach on probation;
- "head_gender"
- "greywage" - The salary does not seem to the tax authorities;
- "way" - how an employee gets to workplace (by feet, by bus etc);
- "extraversion", "independ", "selfcontrol", "anxiety", "novator" - Big5 test scales.
The Code
Packages and uploading
I will be glad to any feedback
library(mlr) library(survival) library(pec) library(survAUC) library(dplyr) library(reshape2) library(ggplot2)
data = read.csv("turnover.csv", header = TRUE, sep = ",", na.strings = "")As you can see I use survival but you can apply this code for every package of Survival Analysis (for example I prefer "randomForestSRC").
Train-test splitting
train = sample(nrow(data), 0.7 * nrow(data)) test = setdiff(seq_len(nrow(data)), train) train.task = makeSurvTask(data = data[train, ], target = c("stag", "event")) train.task test.task = makeSurvTask(data = data[test, ], target = c("stag", "event")) test.task
lrn = makeLearner("surv.coxph", id = "cph")set a grid of hyperparameters
surv_param = makeParamSet( makeDiscreteParam("ties", values = c('efron','breslow','exact')), makeIntegerParam("iter.max", lower = 1, upper = 150), makeIntegerParam("outer.max", lower = 1, upper = 50) )We can know all hyperparameters from
getParamSet("surv.coxph")Parameters of tunecontrol and cross-validation
rancontrol = makeTuneControlRandom(maxit = 10L) set_cv = makeResampleDesc("RepCV", folds = 5L, reps = 5L)
surv_tune = tuneParams(learner = rfsrc.lrn, resampling = set_cv, task = train.task, par.set = surv_param, control = rancontrol)Now we get the most suitable parameters
surv_tune$x $ties [1] "efron" $iter.max [1] 6 $outer.max [1] 46
And the Performance measure of quality of the model
surv_tune$y cindex.test.mean 0.600787cindex = Concordance index (like ROC AUC). And 0.60 is not so good. Then we train on test data
surv.tree = setHyperPars(lrn, par.vals = surv_tune$x) surva = mlr::train(surv.tree, test.task) getLearnerModel(surva) model = predict(surva, test.task) model performance(predict(surva, newdata = data[test, ])) cindex 0.653796more better. So also we can check the performance measure with
library(risksetROC) w.ROC = risksetROC(Stime = data[test, ]$stag, status = data[test, ]$event, marker = model$data$response, predict.time = median(data[test, ]$stag), method = "Cox", main = paste("OOB Survival ROC Curve at t=", median(model$data$truth.time)), lwd = 3, col = "red" ) w.ROC$AUCBut our goal is to predict individual risks of quitting of specific applicants.
mod = coxph(Surv(stag, event) ~ ., data = data[train, ], ties = "efron", iter.max = 6, outer.max = 46) #we apply the received parameters
We take "new" applicants and predict risks
newdudes = data[test, ][c(4,5,12), ] e = predictSurvProb(mod, newdata = newdudes, times = data[test, "stag"]) quantile(e, na.rm=TRUE)
and visualisations
So Sidorson is the best in terms of staff turnover. He has the lowest risks of quitting.
I will be glad to any feedback
df = as.data.frame(t(e)) df = rename(df, Johnson='6', Peterson='11',Sidorson='38') df$time = as.integer(row.names(df)) test_data_long =melt(df, id="time") # convert to long format ggplot(data=test_data_long, aes(x=time, y=value, colour=variable)) + geom_line(size=1.5)+ theme_grey(base_size = 30)+ theme(legend.position="bottom",legend.title = element_blank())+ ylab("Prob")
So Sidorson is the best in terms of staff turnover. He has the lowest risks of quitting.
I will be glad to any feedback
Комментариев нет:
Отправить комментарий