← Back to projects

Employee Attrition
Prediction

πŸ“Š Machine Learning Python Β· Scikit-learn Β· LDA 2026

Most ML projects stop at "my model has 97% recall." This one asks: ok, but what does that actually cost? Predicting who will leave is only half the problem. Knowing what each wrong prediction costs your business is the other half.

7
Algorithms compared
0.857
ROC-AUC (LDA)
82k
CHF saved vs default
3
Business scenarios

The problem

Employee attrition costs companies between 50–200% of an annual salary per lost employee. For a Swiss IT profile, that's roughly CHF 70,000 per person β€” recruiting, onboarding, lost productivity, and institutional knowledge walking out the door.

The standard approach is to train a model and pick the one with the best accuracy or F1. But that ignores something important: in this problem, a false negative (missing someone who's about to leave) is not the same cost as a false positive (intervening with someone who was staying anyway). The default threshold of 0.5 assumes they're equal. They're not.

What makes this different

βš–οΈ

Cost-based threshold tuning

Instead of optimizing for recall or F1, I optimized for total business cost. The threshold that minimizes cost isn't 0.5 β€” it's 0.35 for the Swiss IT market.

πŸ‡¨πŸ‡­

Swiss market adaptation

The IBM dataset uses US salaries. I recalibrated all cost assumptions using real Swiss IT market data (jobs.ch, SECO 2025) with a scaling factor of 1.287.

🏒

3 business scenarios

The "best" model depends on your company. Retaining senior talent? Use threshold 0.12. Budget constraints? Use 0.50. Mixed profiles? 0.25.

πŸ”¬

Honest limitations

The model predicts if someone will leave, not when. Overtime is the top predictor but may reflect job type rather than direct cause. Documented openly.

The 3 scenarios

One threshold doesn't fit all companies. I defined three scenarios based on different business priorities:

Scenario 1

Retention

1,096k CHF

Threshold 0.12 Β· Recall 88% Β· Detects 30/34 employees. Best for senior profiles where losing anyone is expensive.

Scenario 2

Cost Control

1,090k CHF

Threshold 0.50 Β· Precision 79% Β· Only 4 unnecessary interventions. Best when budget for retention is limited.

Scenario 3 ⭐ lowest cost

Balanced

1,016k CHF

Threshold 0.25 Β· Detects 22/34 Β· Moderate interventions. Best for mixed profiles and general use.

Tech stack

Python Scikit-learn LDA Pandas NumPy Matplotlib Seaborn Jupyter StandardScaler Label Encoding Confusion Matrix

What I learned

The most interesting insight from this project isn't technical β€” it's that GridSearchCV with hyperparameter tuning (LogReg, 97% recall, cost 1,782k CHF) was dramatically worse than simple threshold tuning on a base LDA model (88% recall, cost 1,008k CHF). Better metrics don't mean better business decisions.

I also learned that adapting a model to a specific market context requires more than changing currency symbols. The cost ratio between false negatives and false positives fundamentally changes what the optimal threshold is β€” from 0.12 with SHRM's 6.5:1 ratio to 0.35 with Switzerland's 3.4:1 ratio.