The Open Mic Problem

Picture next year's normal. You are making coffee, and without looking up you say, "move my three o'clock to Thursday and pay the electric bill before the late fee." A beat later, it's done. No screen, no taps, no little box asking if you are sure. That is the pitch for voice agents, and honestly, it's a good one. The same sentence that used to be ten taps across three apps is now something you mutter at a kitchen appliance.

The catch is that the thing making it feel magic is exactly the thing that should scare us. The only barrier left between your bank account and the rest of the world is a microphone that never sleeps and was built, on purpose, to do what it is told.

We have spent forty years learning to secure software you look at. Firewalls, permission dialogs, padlock icons, the little URL preview when you hover a link, the "are you sure you want to delete this?" speed bump. Almost every one of those defenses quietly assumes there is a screen, that you are reading it, and that you get a moment to react. Take the screen away and most of our security instincts are standing on air. Voice does not just change the interface. It removes the floor those defenses were standing on.

Let me make the case that this is not three product bugs we can patch one by one, but one property of the medium itself.

A screen makes two promises. Voice keeps neither.

The first promise a screen makes is that you and the machine are looking at the same thing. When a phishing email lands, you and your computer are staring at identical pixels. The attack has to fool you, a human, with human eyes. That shared view is the bedrock of a very old security idea called the trusted path: the assurance that what you perceive is what the system perceives, and that you are really talking to the system you think you are.

Voice quietly tears that up. A microphone hears things you cannot. Researchers showed years ago that you can hide commands in ultrasound, above the range of human hearing, and a voice assistant will happily obey them while the person standing right there hears nothing at all. You can tuck a trigger inside a song, a podcast, the background hum of an ad. For the first time at consumer scale, we have an interface where the human and the machine receive different signals from the same moment. You hear music. Your agent hears "transfer the money." That doesn't sound right, because it isn't, and the unsettling part is that it doesn't sound like anything to you at all.

The second promise a screen makes is that you can stop. Every dangerous action on a screen has a speed bump in front of it, a confirmation, a preview, a diff, a two-second pause where your gut says wait. That friction is not an accident. It is the security. And friction is precisely what voice was invented to delete. The entire value of saying "pay the bill" instead of tapping through a banking app is that there is no app, no taps, no pause. You cannot drop a confirmation dialog into a conversation without ruining the conversation. Try it: imagine your agent reading aloud the full terms of a wire transfer before every payment. Nobody would use it, so nobody will build it that way. What we will get instead is consent theater, a quick "okay, done!" that races past the one moment you might have said no.

So before we have added a single attacker, voice has already broken the two assumptions that hold up most of our defenses. You no longer see what the machine sees, and you no longer get to pause before it acts.

The Silent Trifecta

There is a well-known idea in AI security called the lethal trifecta: give a model access to private data, expose it to untrusted content, and let it talk to the outside world, and you have built a machine that can be tricked into robbing you. It is usually drawn as three overlapping circles, with disaster sitting in the middle where all three meet.

Voice deserves its own version, because voice makes each danger worse and, crucially, makes it invisible. I have started calling it the Silent Trifecta: Unheard, Unseen, Unknown.

Unheard is the input you cannot perceive. The mic hears more than you do, so the attack arrives on a channel you are deaf to.

Unseen is the action you cannot preview. There is no screen to show you the wire transfer, and no time to stop it, because the whole point was to be fast and hands free.

Unknown is the speaker you cannot verify. Voice can be cloned from a few seconds of audio, and anyone within earshot, including your television, can issue a command. A fast food ad once deliberately woke up smart speakers in living rooms across the country and made them read out a burger's ingredients. It was a stunt. The next one won't be. In a voice world, earshot becomes authorization, which is a terrifying thing to make authorization out of.

The Silent Trifecta. Hover any circle or card to focus it; disaster sits where all three overlap.

Put a capable agent, one that can spend money, send mail, unlock doors, in the center where all three circles overlap, and you do not have a vulnerability you can patch. You have a medium that is hostile to the very idea of a checkpoint. It listens on a channel you can't monitor, acts in a form you can't preview, on the word of someone it can't identify, faster than you can object.

Our old tools don't fit the new shape

Walk down the list of things we normally reach for and watch them come up short. Audit logs feel like an answer until you remember that a transcript is not the recording. The attack lives in what the transcriber threw away, the inaudible carrier, the timing, the subtle distortion. The log will show a clean, sensible sentence and tell you nothing about how it really arrived. Multi factor authentication assumes a second device and a spare moment to tap it, both of which the hands free dream specifically removes. Rate limiting and friction are the obvious defense, and they are also the one thing a voice product cannot ship without becoming the slow, annoying app it was built to replace. Even our oldest mantra, "what you see is what you get," curdles into "what you hear is not what it hears."

And we have not even reached the strange part yet. Soon your agent will not only listen to you. It will talk to other agents. It will call your bank and sit on hold for you, answer your phone, negotiate a refund, book the plumber. Two AIs will hammer out the details of your week over a synthetic phone call, each of them, by design, persuadable, each of them able to be talked into things by a confident voice. We spent the last decade worrying about a dead internet of text bots. Get ready for the dead internet you can hear, conducted entirely in voices that sound completely human and trust each other completely.

So what do we actually do

I am not arguing that we should not build voice agents. That ship has not just sailed, it is already taking your dinner reservation. I am arguing that we cannot port the screen era's safety net and call it done, because the net was woven for a shape voice does not have.

What we need is a trusted path for the ear, and the irony is that it probably routes through the boring old screen. The high stakes confirmation should not happen in the conversation at all. It should jump out of band, to your phone, your watch, a device that can show you the actual amount and the actual recipient and make you tap. The agent itself should treat every voice as untrusted by default, the way a well written web server treats every byte from the network as hostile until proven otherwise. Earshot is not authorization, and we have to design as if anyone, and anything, might be talking. The agent should know the difference between a thing it can do and a thing it should refuse to do on voice alone, no matter how confidently it was asked.

None of that is as slick as just saying "pay the bill." Good security rarely is. But the alternative is shipping the most capable software we have ever built behind a permanently open mic, and hoping the only person who ever steps up to it is you.

The most dangerous words in computing used to be "click here." Soon they will be whatever your coffee maker happens to overhear. We should talk about this now, while we still can, and ideally somewhere the agent isn't listening.