It’s almost hard to remember what it was like before the whole world seemed to be powered by AI. Whether you like it or not, the tech industry is betting AI can improve everything from your iPhone to your relationship with animals. And the hype doesn’t stop there.
As 2025 inches closer, there is no shortage of AI-powered product launches. OpenAI just announced Sora, its long-awaited photorealistic video generator, as part of “Shipmas,” a 12-day series of product releases that rhymes with Christmas. Reddit launched an AI search tool that helps you tap into its hive mind without using Google. That shouldn’t bother Google, which has a new quantum computing chip called Willow that promises to supercharge what AI can do.
Then there’s Microsoft. After investing $13 billion in OpenAI and becoming an early mover in the generative AI race a couple years ago, Microsoft has been building out its stable of AI offerings in recent months. In March, the company acquired Inflection AI and appointed its co-founder, Mustafa Suleyman, as CEO of Microsoft AI, overseeing its consumer AI products including Copilot, Bing, and Edge.
Microsoft’s big early-December product launch doesn’t quite match Sora in terms of scale and buzz, but the new AI tool does do something entirely original.
It’s called Copilot Vision. The basic idea is that Vision allows Copilot, Microsoft’s AI-powered chatbot, to see what you’re seeing in an internet browser. Microsoft calls Copilot an “AI companion,” and with Vision, it makes sense. If you’re shopping for furniture on Wayfair, for instance, you can ask Copilot to find something with a bit of a Memphis design vibe, even if you have no idea what a “Memphis design vibe” even means. Copilot then scans the entire webpage, looking for images that match what you’re asking for, and then points you in that direction. In other words, it can see what you’re seeing on Wayfair, and it can answer all your questions about it. It’s unlike any web browsing experience I’ve ever had.
There are a lot of caveats here. Copilot Vision is rolling out in preview for a limited number of Copilot Pro subscribers who have also signed up for Copilot Labs. You have to use the Edge browser, and it only works on certain websites. Microsoft also deletes all of the information from every session after you’re done, which helps protect your privacy.
I’ve spent a few days using Copilot Vision, and I’ll admit, it’s a little weird talking to a robot about the website you’re looking at. That said, the AI is great at summarizing articles on Wikipedia, and it’s fun to explore maps on TripAdvisor. But because it only works on about a dozen websites right now, I’ll admit that its abilities feel pretty constrained.
To me, though, the ambition here feels somewhat limitless. What if an AI assistant could see your whole world, know what was what, and help you navigate it? That’s the big idea behind Copilot Vision. Letting an AI watch you browse the web is just a step toward putting an AI companion on your shoulder everywhere you go.
In a recent interview, Suleyman explained why collaborating with AI is the future of computing. This is frankly something I’ve heard tech companies pitch for years. But the way things are heading, I’m finally starting to buy it.
Our conversation has been lightly edited for clarity and length.
How do you feel about where things stand at the moment? Are you surprised at how mainstream AI is in 2024 or did you think it would be further ahead or behind?
Part of me feels like it’s been frustratingly slow and we could be making more progress faster. But part of me is also just overwhelmed with how awesome these models are. For the first time in history, we actually have more science than we know how to apply in technology and products. These large language models that we’ve been developing, we’re only just beginning to sort of understand the limits that they have and what they can’t do. Every week I see people unlocking new capabilities.
It’s probably the most creative time in technology, in the tech industry, that I can think of for actually inventing and creating new experiences, and that’s what I’ve always been passionate about. Like, how do I create a personal AI or personal AI companion? How do I really make people feel like there is a smooth, fluent, conversational companion in their corner that helps them go about their daily lives? And now I actually have the clay to be able to sculpt this new species at my fingertips.
What kind of hang-ups do you think people have about using AI?
Most of the time people are asking themselves, “What do I use this for?” Like any new, general-purpose technology which can do anything, it kind of leaves the user thinking, “Well, if it can do anything, what am I going to do with it?” And so the way that we’ve designed Copilot is to help lead the conversation. It asks great questions. It’s inquiring. It listens actively while you’re in the voice mode. It interrupts at the right time. It has a different intonation and pace depending on the subject matter of the conversation.
We’ve tried to work around some of the limitations of what we would otherwise call the “cold start problem,” knowing how to get the most out of a technology, or where to begin a conversation by creating a much more fluent and smooth conversational interaction.
Voice assistants, like Siri and Alexa, have been around for a while. I know a lot of people that don’t use them every day. I don’t use them every day. Do you think that there needs to be a breakthrough moment — and maybe this is the moment — to get people to talk to their computers and to have that feel natural?
I mean, I use voice every day now. It’s the first thing I do when I think to search for something. It is so much faster. It’s so much easier. It’s so much more accurate than having to type into your phone or onto your keyboard. And the crazy thing about it is that it can then ask you a question back and keep the dialogue going. And it’s a totally new modality.
For the first time, it actually works. Unlike in the past, where voice commands were limited to, like, a few fixed phrases, like turn on the lights, or what’s the capital of [France], and they never really work very well. Today, it is just like having a conversation with a friend, and that unlocks a different type of interaction. You’re no longer limited by having to formulate your thoughts into a search query, read a bunch of links on the page when they arrive in the results page, and then go and look at that web page. You can just actually ask Copilot, as though you’re talking to a friend or a knowledgeable expert or an adviser.
We’re seeing huge growth in the voice-first interactions, much longer sessions, much more variety of topics that are being covered, much more frequent, because you can just whip it out and within one click get into a dialogue on your headphones with a super knowledgeable, supportive expert. So yes, I definitely think it is the future of computing.
We’ve been browsing the web the same way for as long as the web has existed, basically. Why do you think we’ve been stuck?
We’ve been stuck in search and browser for 20 years, in the same paradigm, and it’s because computers haven’t been able to speak our language. We’ve had to learn their language, right? And that interface layer mediated our communication, and that layer is buttons and search queries. Search queries don’t represent the richness of your ideas in conversation. You’re forced to express things in short form, in a very curiously odd, limiting way, right? And you can’t ask any follow-up. Whereas now, computers understand our language, they speak plain English — fast, interactive, kind, and supportive.
Voice is clearly a totally new paradigm, and the interface always changes what you can do with a piece of technology. So now that we have a brand-new interface, it will open up a whole new range of possibilities for the types of questions that you ask and the types of learning experiences that you go on.
Which brings us to Copilot Vision. What does that new range of possibilities look like?
Copilot Vision is a mind-blowing experience. This is a first in the industry. We are going to be the first to launch, at scale, general purpose, visual understanding of your entire web page. And this is transformative because it means that you can say to your Copilot, “What is this?” or, “What is that?” or, “Look over there. What does that thing look like?” And that ambiguous reference — the this or that or there — is a much more human way of thinking about interacting with the world.
We do that when you’re standing next to a friend or a colleague or a family member, and you’re both looking at the same plant or the same sofa, or looking at the same dress. And that is what a companion feels like. That’s what we’ve been designing with Copilot Consumer is these visual, interactive companion-like experiences, and it is just mind-blowing. It is totally different. When people get a chance to really use it for themselves, you’ll see that it is just a completely different way of interacting with a computer.
It seems like the natural next step here would be to get Copilot Vision to actually do stuff for you on the web instead of just describing things for you.
You’re absolutely right. The first step is that your AI companion should be able to hear what you hear and talk to you in your language. The second step is that it should see what you see and be able to talk to you about what you’re both seeing at the same time. The third step is that it should be able to take actions on your behalf, fill in forms, buy things, book things, plan, navigate, click on drop-down menus, and so on. All of that is coming down the road and in the foreseeable future.
Hallucinations are a big drawback of a lot of generative AI. Does that happen with Vision, or is it different because it’s dealing with a set amount of information?
Humans get stuff wrong. Sometimes when they’re looking into the distance, they mislabel a color, or they get the genre of a piece of furniture wrong. Sometimes that happens, but very rarely. It’s actually more accurate than your average person at identifying those kinds of objects in the images because it’s seen so much more data. It’s had so much more experience. So it’s not like there’s zero hallucinations, but it isn’t a significant limitation of the experience.
Looking ahead to the future, into the evolution of hardware, how do you think moving from phones to glasses or headsets will steer our relationship with AI?
I think people are increasingly exhausted with their screens and having to pull out your screen to type something into it or take a photo with it. It’s an additional layer of friction and an intrusion into your being present in the moment. So I definitely think that various kinds of wearables are going to become more and more present in the next few years, and we’re certainly thinking about that very carefully.
AI seems great for the workplace. It’s very helpful at summarizing meetings or making presentations. But I’m having a harder time making it make sense at home. What am I missing?
Think about this: Any time you go to search something, any question that comes to mind, whether it’s like a restaurant or the time of day in another country, or where you should buy your next car, or, you know, what the answer is to some general knowledge question, or what happened in the news yesterday, what the sports score was — all of those things are now faster and easier and more interactive and more conversational with Copilot. So I think your first thought should be, let me just ask my AI companion, because the answer is much more digestible. It’s much more succinct, and it’s instantly available for the follow-up question, if you want to go down that far.