F
17

Hit 95% accuracy on my hiring algorithm and still got burned by bias

I was training a resume screening model for a mid size company in Austin. Got it to 95% accuracy on the test set. Felt great until an auditor ran it against a fairness toolkit and found out it was rejecting women for warehouse roles at twice the rate. The accuracy number just hid the problem. Has anyone else had a high performing model that failed a fairness check in a way you didn't expect?
3 comments

Log in to join the discussion

Log In
3 Comments
martin.riley
Accuracy numbers are a trap if you don't dig into what they're actually measuring. 95% doesn't mean much when the model is basically learning to reject people based on zip codes or names that correlate with gender. The real problem is that most folks stop at overall accuracy and never check for these slices. You gotta run the data through a fairness tool from day one, not as an afterthought. That's the only way to catch this stuff before it's live and causing damage.
4
elliotm57
elliotm5712d ago
Throw a fairness tool on that thing and watch the accuracy number tank while the model suddenly stops being racist. Feels like 95% is just code for "we didn't look hard enough," right?
-1
ward.jamie
ward.jamie12d ago
That zip code thing Martin mentioned is real, I've seen it too. A buddy of mine was building a model for a retail chain in Atlanta and it learned to flag certain zip codes as "high risk." Turned out those zip codes were mostly black neighborhoods. The model was 94% accurate but it was basically redlining applicants. Here's what I keep coming back to though - when you found that gender bias in the warehouse roles, did you trace it back to a specific feature or was it something more hidden like the way the job descriptions were written? Because I've noticed a lot of these bias issues sneak in through the training data itself, not just the model. Like if the warehouse historically hired mostly men, the model just learns that pattern and then you're stuck trying to untangle it. Did you ever figure out if it was the data or the model architecture that caused the problem?
1