The Water Cooler

A blog of fresh ideas and findings from organizational leaders and researchers on how they’re making work better, shared regularly.

Hiring, honeybees, and human decision making

Filed under: People Analytics Hiring
Hiring, honeybees, and human decision making
Taking a tip from how teachers grade a stack of tests, “chunking” job applications has been shown to help reduce bias and increase the accuracy in hiring.

Dozens of studies have shown that the choices we make over what we eat, how we save, and even how we vote can be affected by how those choices are presented: their choice architecture.

Research has found, for example, that the order of candidates’ names on a ballot sheet can influence election outcomes (hint: you want to be first). Studies into what’s called ‘successive contrasting’ have found that certain species of animal — like honeybees — will abandon otherwise healthy feeding grounds if they’ve just been in plentiful settings: ‘good’ can seem ‘bad’ when in contrast to ‘excellent’. Studies on humans have shown that how attractive someone seems is at least in part a function of how attractive the last person you saw was! In all, the judgements we make can be highly contextual.

So how do you remove as much of the luck, noise, and bias in candidate review as you can? Using structured interviews is a great way to consistently ask candidates the same questions, but we wondered if how we review the candidate’s answers was as consistent as it could be. We thought that instead of grading all of one candidate’s responses and then moving onto the next candidate, we should instead “chunk” responses, reviewing different candidates’ responses to the same question together (a trick school teachers grading tests have used for years). So we ran an experiment to see if these quirky ordering effects impacted hiring decisions, and if our chunking method is actually successful at reducing bias.

Using an online experimental platform, we asked 150 reviewers to rate 100 unique responses to four work-related challenges, drawn from real applications to the Behavioural Insights Team. Responses were anonymized, chunked by question, and their order within each question was randomized across every reviewer. So each reviewer scored a batch of randomly ordered responses to question 1 on a scale of 1 (unsatisfactory) to 5 (exceptional), and then a batch of randomly ordered responses to question 2, and so on. We then compared these to a benchmark score combining every reviewer’s score of that response.

The results were clear: context matters. Specifically, we found three ordering effects at play in the candidate review process:

  1. Reviews get more accurate over time. We compared whether the score given to a particular response was different if the reviewer read it as the first in the batch, 9th, 17th, or last, by looking at the deviation between the reviewer’s score and the average score of all reviews. The group’s average deviation from the first to the last rated response for the first question falls by roughly 17% and this result was highly statistically significant.

  2. It helps to be first in line. The average rating across all candidates is 3.35, but being reviewed first increases that to 3.52. Taken over multiple questions, this can turn out to be decisive. Interestingly, the positive effect of being first appeared for the first question, but also for each subsequent question. Combined with finding 1, this suggests that reviewers go through some general calibration when they start off, but also that they recalibrate slightly within each question.

  3. It matters who comes before you. We took the top 10 and bottom 10 candidates on each question (as rated by all reviewers) and looked at the impact they had on scores for the next few candidates. We found strong evidence that scores given to candidates were affected by the strength or weakness of the candidate seen immediately before, even controlling for our earlier findings. An average candidate gets a lower score if they come after a phenomenal candidate; but they’ll get a higher score if they come after a poor one. These ‘spillover’ effects are more extreme in the latter case, and it turns out that they affect more than just the next candidate: two and three down the line benefit from having a poor candidate reviewed recently!

Kate Glazebrook is the CEO and cofounder of Applied, a venture spun out from the Behavioural Insights Team that works on using science to make recruiting better. Janna Ter Meer is a Research Advisor on the Behavioural Insights Team.

This blog is part of our re:Work series on behavioral economics. Read more: