The Ian Atha Museum of Internet Curiosities
Published on

The more confident the machine looks

Authors
  • avatar
    Name
    Ian Atha
    Twitter
    @IanAtha
    Ian Atha is an Athens-based technologist, ex-OpenAI, building at the intersection of craft, code, and civic life.

Talk to anyone who has spent more than a year reviewing flagged content for a large platform and you will hear a version of the same complaint. The model that suggests labels gets steadily better, the reviewers who oversee it get steadily worse, and by the second quarter of the year throughput has improved while the catch rate on subtle errors has fallen off a cliff that nobody can quite date. This is not a training problem and it is not a motivation problem. Put any competent person in front of a mostly-correct automated suggestion and you reproduce the curve, more or less, regardless of what they happen to be reviewing.

The research on the effect predates LLMs by about three decades. Aviation human-factors calls it automation bias, after Mosier and Skitka's experiments in the mid-nineties showing that pilots will accept a faulty cockpit recommendation even when their own instruments contradict it; commission errors when the aid is wrong, omission errors when the aid misses something. The distributed-cognition people, working from Hutchins's Cognition in the Wild, frame the same phenomenon as cognitive offloading: a cognitive task is never really performed by one brain working alone, and inserting an automated aid into the loop changes the membership of the system performing the task. Responsibility for scrutiny redistributes along with everything else, usually without anyone deciding it should. A reviewer's felt sense of "I own this output" tends to drop as soon as something else in the room appears willing to bear that responsibilty, regardless of what the formal accountability diagram still claims.

The version of the argument that I keep coming back to, though, is Lisanne Bainbridge's. Her 1983 paper, Ironies of Automation, is six pages long and remains embarassingly current forty years on. The setup: in any plant where a control system is being introduced, the easy cases get automated first, because they are the easy cases. What this leaves the human operator with is two classes of work that used to be intermixed and are now separated. The mostly-correct automated process needs monitoring, and the residual hard cases need handling on the rare occasions they arise. Both of these are now harder than they were before automation arrived, in slightly different ways.

Monitoring a mostly-correct process is, on the evidence of a really embarrassingly large sustained-vigilance literature, the single thing humans are reliably worst at. The useful attention window in lab settings tops out somewhere around twenty minutes; after that, attention collapses regardless of how good the operator is or how motivated. The hard residual cases, when they arrive, arrive in the hands of an operator whose underlying skill has been eroding the entire time, because the easy cases were where intuition got built and maintained in the first place. Bainbridge's point is that automation does not reduce the importance of the operator's skill so much as concentrate the importance of that skill into precisely the moments when the skill has been allowed to atrophy.

Forty years later, this is also a description of every human-in-the-loop review pipeline I have ever seen in production.

The framing I find most useful for actual interface design comes from somewhere unexpected, which is Erving Goffman's 1981 essay Footing. Goffman's observation, slightly compressed, is that the speaker role in any utterance is not unitary. There is an author, whose position the words represent, and there is an animator, who physically utters them, and the two are very often the same person but are sometimes very pointedly not. A press secretary at a podium is animator only. The same person, telling you about their own weekend, is both at once.

When a model generates a suggestion and a human reviews it, the reviewer's footing has slid from author toward animator. They didn't write the words. They are deciding whether to let the words through. Authorship carries a sort of epistemic obligation that animation simply doesn't, and the shift between them happens invisibly, through interface design, so that the same reviewer in the same job feels meaningfully different about the same piece of work depending only on where the words on the screen originated. This is most of why the common intuition that a human reviewer is a "safety net" under an AI system fails so often in practice. The net is only really taut when the person holding it still feels they are partly the source of what they're letting through.

The thing all of this points at, taken together, is that the location of epistemic responsibility matters more than accuracy. A 95%-accurate aid that puts its reviewer into animator mode produces worse joint outcomes than a 70%-accurate aid that keeps the reviewer in author mode, because the system's effective error rate is never the model's error rate alone but the joint rate of model plus reviewer, and a disengaged reviewer's contribution to that joint rate falls off toward zero. Which means the joint rate falls off toward whatever the model can do unaided. Which is a long way of saying that an organisation can spend a great deal of money building the review-less system that the procurement commitee was assured nobody was building.

A second-order effect of this is one that keeps surprising teams I work with: making the model better often makes the joint system worse. Better suggestions accelerate the footing shift faster than the model's error rate drops to compensate for it. While the suggestions are visibly unreliable, the reviewer keeps reviewing. Once the suggestions become routinely correct, review effectively stops, and the system inherits whatever the model fails on, in full, without a second pair of eyes.

What follows from this for actual interface work is fairly direct, even if every team I have watched skips most of it. Bulk-accept buttons in particular should not exist; the moment the interface offers an "accept all remaining" affordance the reviewer is fully in animator mode and the design has, for purposes of catching anything subtle, already failed. Per-item friction, irritating as that sounds when one is trying to ship a productivity tool, is what most reliably keeps the authorial attitude alive. Anything generated by the aid should also look pre-attentively different from anything a human authored or confirmed, with the specifics of how (a dashed border, a muted background) mattering much less than whether the eye can sort the screen at a glance into machine-output and human-output. Goffman's footing is invisible if you don't make it visible.

The underlying data being labeled needs to be present in the same view as the label, at a size someone can actually read it at, and this is so frequently violated in shipped review tools that flagging it has stopped feeling productive. The common pattern is a tight scrollable list of suggested actions with the raw material a click away. One click away is already too far away; author mode requires that examining the evidence be cheaper, in attention-cost, than skipping it. Inserting periodic spot-checks into the queue — items the system has classified correctly, presented as fresh, with agreement tracked over time — is one of the few interventions against the Bainbridge vigilance problem that demonstrably works in the field, and it works largely because the reviewer knows it is happening.

Confidence scores are the place where I tend to get genuinely nervous. For most reviewers, "97% confidence" reads as permission not to look, which is the opposite of where you want their attention going, since the high-confidence items are by construction the ones a sleepy reviewer will accept fastest. Whatever a confidence score is doing in the interface, it should probably be doing it in a way that increases scrutiny on low-confidence items rather than excusing the high-confidence ones. A UI that lets a reviewer sort by "we're pretty sure" and batch-accept has designed its own collapse in from the start.

None of this is a complete program and none of it is sufficient on its own. What it does, in combination, is push back slightly against a default gradient that otherwise runs toward offloading as soon as the tool gets good enough to make offloading feel like a reasonable thing to do.

The thing I end up saying to clients, because it is the part they remember, is that the more confident the machine looks the less carefully the human looks, and these two processes are not independant of one another. Every human-in-the-loop system I have watched fail in production has failed because someone designed it on the assumption that a model's accuracy and a reviewer's vigilance were two separate variables to be optimised separately. They are not. The coupling between them is strong enough to consume a non-trivial error budget if it is left unattended, and past some accuracy threshold it may simply be cheaper, and safer, to make an AI's output look less confident than it is than to try to make the human reviewing it more vigilant than they can sustainably be.


References

Bainbridge, L. (1983). Ironies of Automation. Automatica, 19(6), 775–779.

Goffman, E. (1981). Footing. In Forms of Talk. University of Pennsylvania Press.

Hutchins, E. (1995). Cognition in the Wild. MIT Press.

Mosier, K. L., & Skitka, L. J. (1996). Human Decision Makers and Automated Decision Aids: Made for Each Other? In R. Parasuraman & M. Mouloua (Eds.), Automation and Human Performance: Theory and Applications (pp. 201–220). Erlbaum.

Parasuraman, R., & Riley, V. (1997). Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors, 39(2), 230–253.

Ian Atha is an Athens-based technologist, ex-OpenAI, building at the intersection of craft, code, and civic life.