The more confident the machine looks

Talk to anyone who has spent more than a year reviewing flagged content for a large platform and you will hear the same complaint. The model that suggests labels gets steadily better. Reviewers who oversee it get steadily worse. By the second quarter of the year throughput has improved while the catch rate on subtle errors has fallen off a cliff. This is not a training problem and it is not a motivation problem. Put any competent person in front of a mostly-correct automated suggestion and you reproduce the curve regardless of what they are reviewing.

The research on the effect predates LLMs by about three decades. Mosier and Skitka's experiments in the mid-1990s showed that experienced pilots will accept a faulty cockpit recommendation even when their own instruments contradict it, and introduced the term "automation bias". They make commission errors when the aid is wrong and omission errors when the aid misses something.

Distributed-cognition researchers, working from Hutchins's Cognition in the Wild, frame the same phenomenon as cognitive offloading. A cognitive task is never performed by one brain working alone. When you insert an automated aid into the loop, it changes the membership of the system performing the task. Responsibility for scrutiny redistributes along with everything else, usually without anyone deciding it should. A reviewer's felt sense of "I own this output" tends to drop as soon as something else in the room appears willing to bear that responsibility, regardless of what the formal accountability diagram still claims.

The version of the argument that I keep coming back to, though, is Lisanne Bainbridge's. Her 1983 paper, Ironies of Automation, is six pages long and remains embarassingly current forty years on. The setup: in any plant where a control system is being introduced, the easy cases get automated first, because they are the easy cases. What this leaves the human operator with is two classes of work that used to be intermixed and are now separated. The mostly-correct automated process needs monitoring, and the residual hard cases need handling on the rare occasions they arise. Both of these are now harder than they were before automation arrived, in slightly different ways.

Monitoring a mostly-correct process is, on the evidence of an embarrassingly large sustained-vigilance literature, the single thing humans are reliably worst at. The useful attention window in lab settings tops out somewhere around twenty minutes; after that, attention collapses regardless of how good the operator is or how motivated. The hard residual cases, when they arrive, arrive in the hands of an operator whose underlying skill has been eroding the entire time, because the easy cases were where intuition got built and maintained in the first place. Bainbridge's point is that automation does not reduce the importance of the operator's skill so much as concentrate the importance of that skill into precisely the moments when the skill has been allowed to atrophy.

Forty years later, this is also a description of every human-in-the-loop review pipeline I have ever seen in production.

The framing I find most useful for actual interface design comes from somewhere unexpected, which is Erving Goffman's 1981 essay Footing. Goffman's observation, slightly compressed, is that the speaker role in any utterance is not unitary. There is an author, whose position the words represent, and there is an animator, who physically utters them, and the two are often the same person but are sometimes very pointedly not. A press secretary at a podium is animator only. The same person, telling you about their own weekend, is both at once.

When a model generates a suggestion and a human reviews it, the reviewer's footing has slid from author toward animator. They didn't write the words. They are deciding whether to let the words through. Authorship carries a sort of epistemic obligation that animation doesn't, and the shift between them happens invisibly, through interface design, so that the same reviewer in the same job feels meaningfully different about the same piece of work depending only on where the words on the screen originated. This is most of why the common intuition that a human reviewer is a "safety net" under an AI system fails so often in practice. The net is only taut when the person holding it still feels they are partly the source of what they're letting through.

The point is that epistemic responsibility matters more than accuracy. A 95%-accurate aid that puts its reviewer into animator mode produces worse joint outcomes than a 70%-accurate aid that keeps the reviewer in author mode. The system's effective error rate is the joint rate of model plus reviewer. A disengaged reviewer contributes almost nothing to catching errors, so the joint error rate approaches whatever the model can do alone. In other words, an organization can spend a great deal of money building the very review-less system the procurement committee was assured nobody was building.

A second-order effect of this keeps surprising teams I work with: making the model better often makes the joint system worse. Better suggestions speed up the footing shift faster than the model's error rate drops to compensate. While the suggestions are visibly unreliable, the reviewer keeps reviewing. Once the suggestions become routinely correct, review stops. The system then inherits whatever the model fails on without a second pair of eyes.

What follows from this for actual interface work is direct, even if most teams skip most of it. Bulk-accept buttons should not exist. The moment an interface offers an "accept all remaining" affordance, the reviewer switches to animator mode and the design has already failed at catching subtle problems. Per-item friction keeps the authorial attitude alive. It's annoying when you're shipping a productivity tool, but the alternative is worse. Anything generated by the aid should look visually different from anything a human authored or confirmed—whether through a dashed border, muted background, or something else. The specific technique matters less than whether someone can sort the screen at a glance into machine output and human output. Goffman's footing is invisible if you don't make it visible.

The underlying data being labeled needs to be present in the same view as the label, at a size someone can actually read it at. This principle is frequently violated in shipped review tools. The common pattern is a tight scrollable list of suggested actions with the raw material a click away. One click away is already too far away. Author mode requires that examining the evidence be cheaper, in attention-cost, than skipping it.

Inserting periodic spot-checks into the queue helps. These are items the system has classified correctly, presented as fresh, with agreement tracked over time. Spot-checks are one of the few interventions against the Bainbridge vigilance problem that demonstrably works in the field. They work largely because the reviewer knows it is happening.

Confidence scores are the place where I tend to get genuinely nervous. For most reviewers, "97% confidence" reads as permission not to look, which is the opposite of where you want their attention going, since the high-confidence items are by construction the ones a sleepy reviewer will accept fastest. Whatever a confidence score is doing in the interface, it should probably be doing it in a way that increases scrutiny on low-confidence items rather than excusing the high-confidence ones. A UI that lets a reviewer sort by "we're pretty sure" and batch-accept has designed its own collapse in from the start.

None of this is a complete program and none of it is sufficient on its own. What it does, in combination, is push back slightly against a default gradient that otherwise runs toward offloading as soon as the tool gets good enough to make offloading feel like a reasonable thing to do.

The thing I end up saying to clients, because it is the part they remember, is that the more confident the machine looks the less carefully the human looks, and these two processes are not independant of one another. Every human-in-the-loop system I have watched fail in production has failed because someone designed it on the assumption that a model's accuracy and a reviewer's vigilance were two separate variables to be optimised separately. They are not. The coupling between them is strong enough to consume a non-trivial error budget if it is left unattended, and past some accuracy threshold it may simply be cheaper, and safer, to make an AI's output look less confident than it is than to try to make the human reviewing it more vigilant than they can sustainably be.

References

Bainbridge, L. (1983). Ironies of Automation. Automatica, 19(6), 775–779.

Goffman, E. (1981). Footing. In Forms of Talk. University of Pennsylvania Press.

Hutchins, E. (1995). Cognition in the Wild. MIT Press.

Mosier, K. L., & Skitka, L. J. (1996). Human Decision Makers and Automated Decision Aids: Made for Each Other? In R. Parasuraman & M. Mouloua (Eds.), Automation and Human Performance: Theory and Applications (pp. 201–220). Erlbaum.

Parasuraman, R., & Riley, V. (1997). Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors, 39(2), 230–253.

Ian Atha is an Athens-based technologist, ex-OpenAI, building at the intersection of craft, code, and civic life.