What it means that new AIs can “reason”

20.09.2024 16:00

Vox

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step.

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in.

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology.

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.”

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product.

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture.

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model.

Sometimes, big things can come from small improvements

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far.

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.”

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever.

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

Клиника UltraSave – медицинская косметология нового уровня

Прислать следственную группу из Москвы просят у Бастрыкина жители тюменского Перевалово

Сергей Собянин рассказал, как город поддерживает экспорт в страны БРИКС

«Верить в доступность платежа под ставку 22% глупо». Кто будет покупать квартиры в новостройках Севастополя?

This story was first featured in the Future Perfect newsletter.

The power of think, then answer

How to (and how not to) evaluate an AI

Sometimes, big things can come from small improvements

Читайте на 123ru.net

Game24.pro

Жизнь

Новини України

Авто Новости

Частные объявления в Вашем городе, в Вашем регионе и в России

Новости от наших партнёров в Вашем городе

В Оренбуржье житель Уфы отвлекся на телефонный звонок и спровоцировал ДТП

У давшего взятку главному кадровику МО Кузнецову изъяли имущество в Краснодаре

Топ популярнейших игроков русской версии Transfermarkt возглавили новички «Амкала» Ахильгов и Папикян. Они опередили Пьянича, Роналду и Месси

РСХБ и Фонд «Экология» с начала года высадили более 70 000 деревьев

Будет ли у Казахстана матч с соседом? Решение за Черчесовым

Что нового в законах об апартаментах в 2024 году?

Как адаптировать коллектив к новым вызовам и изменениям

Боррель предупредил, что Ближний Восток находится на пороге "полномасштабной войны"

Появилось видео из ПВР для жильцов ЖК «Селигер Сити», где произошел пожар

Стало известно, кем в детстве мечтали стать жители Москвы

Тело одного погибшего обнаружено на месте крупного пожара в Москве

Суд может лишить Ле Пен права участия в выборах президента во Франции

Сергей Светлаков презентовал комедию «Беляковы в отпуске» в Москве

Рекордное количество участников привлёк Конкурс экологических проектов в Мытищах

Молодёжное крыло Народного фронта провело викторину в преддверии Дня Интернета в России.

Как адаптировать коллектив к новым вызовам и изменениям

Хачанов обыграл Черундоло и вышел в четвертьфинал турнира ATP в Пекине

Первую ракетку мира подводят под срок // Всемирное антидопинговое агентство будет добиваться дисквалификации Янника Синнера

Янник Синнер не приедет на турнир ATP-500 в Вену

Рахимова обыграла Биррелл и вышла во второй круг WTA 1000 в Пекине

Терминал сбора данных (ТСД) промышленного класса SAOTRON RT42G

Стоит ли отдавать ребенка в частную школу?

Колымская красненькая...

Пиво и образование помогло Германии бороться с изменением климата

Топ новостей на этот час

Причина гибели человека на пожаре в московском ЖК «Селигер сити» – потолки

Клиника UltraSave – медицинская косметология нового уровня

Москвич пырнул ножом пассажира метро в поезде на «оранжевой» ветке

Сергей Собянин рассказал, как город поддерживает экспорт в страны БРИКС