Nobody predicted the AI revolution, except for the 352 experts who were asked to predict it.
In 2016, three years before OpenAI released GPT-2 and the world went crazy, an independent researcher named Katja Grace cold-emailed the world’s leading AI scientists. She had some questions. A lot of questions, actually. When will AI be able to fold laundry? Write high school essays? Beat humans at Angry Birds? Why doesn’t the public understand AI? Will AI be good or bad for the world? Will it kill all humans?
The world’s leading AI scientists are a surprisingly accommodating group. Three hundred fifty-two of them took time out of their busy schedules to answer, producing a unique time capsule of expert opinion on the cusp of the AI revolution.
Last year, AI started writing high school essays (laundry folding and Angry Birds remain unconquered). Media called the sudden rise of ChatGPT “shocking,” breathtaking,” and “mind-blowing.” I wondered how it looked from inside the field. How did the dazzling reality compare to what experts had predicted on Grace’s survey six years earlier?
Looking at the most zoomed-out summary — whether they underestimated progress, over-hyped it, or got it just right — it’s hard to come to any conclusion other than “just right.”
The survey asked about 32 specific milestones. Experts were asked to predict the milestones in several ways. In what year did they think it was as likely as not that AI would reach the milestone? In what year did they think there was even a 10% chance AI would reach it? A 90% chance? What did they think was the chance AI would reach the milestone by 2026? By 2036? I focus on their median prediction of when AI will reach the milestone.
It’s hard and subjective to figure out exactly when AI first achieved something, but grading the list as best I can:
Another way of framing “50% confidence level” is “you’re about equally likely to get it too early as too late.” The experts got six of these milestones too early and six too late, showing no consistent bias towards optimism or pessimism.
And when they were wrong, they were only wrong by a little bit. Grace asked the experts to give their 90% confidence interval. Here the experts were wrong only once — they were 90% sure AI would have beaten humans at the video game Angry Birds by now, but it hasn’t.
The accuracy here is mind-boggling. In 2016, these people were saying, “Yes, AI will probably be writing high school history essays in 2023.” I certainly didn’t expect that, back in 2016! I don’t think most journalists, tech industry leaders, or for that matter high school history teachers would have told you that. But this panel of 352 experts did!
I would be in awe of these people, if not for the second survey.
Prediction Is Very Difficult, Especially About the Past
The six years between 2016 and 2022 were good ones for AI, forecasting, and Katja. AI got billions of dollars in venture capital investment, spearheaded by fast-growing startup OpenAI and its superstar GPT and DALL-E models. The science of forecasting, which only reached public attention after the publication of Philip Tetlock’s Superforecasting in late 2015, took off, and started being integrated into government decision-making. As for Katja, her one-person AI forecasting project grew into an eight-person team, with its monthly dinners becoming a nexus of the Bay Area AI scene.
In summer 2022, she repeated her survey. The new version used the same definition of “expert” — a researcher who had published at the prestigious NeurIPS or ICML conferences — and got about the same response rate (17% in 2022 compared to 21% in 2016). The new asked the same questions with the same wording. Most of the experts were new, but about 6% (45 out of 740) were repeats from the previous round. You can never step in the same river twice, but this survey tried hard to perfectly match its predecessor.
This time, nine events happened earlier than the experts thought, and zero happened later, or on time. In fact, eight of the nine happened outside their 90% confidence interval, meaning the experts thought there was less than a 10% chance they would happen as early as they did!
But actually it’s much worse than that. In 2019, a poker AI called Pluribus beat human players — including a World Series of Poker champion — at Texas hold ’em (the Scientific American article was called “Humans Fold: AI Conquers Poker’s Final Milestone”). All three of the judges agreed that this satisfied milestone 31: “Play well enough to win the World Series of Poker.” Still, Katja wanted to make her survey exactly like the 2016 version, so she included this and several other already-achieved milestones. The experts predicted it wouldn’t happen until 2027. Same with image categorization and Python Quicksort — both happened in 2021; in both cases the 2022 experts predicted it would take until 2025. Yogi Berra supposedly said that “prediction is very difficult, especially about the future.” But in this case the 2016 panel predicted the future just fine. It was the 2022 panel that flubbed predictions about things that had already happened!
Maybe this was an unfair trick question? It wasn’t impossible to answer zero (a few respondents did!), but maybe it was so strange to see already-achieved milestones on a survey like this that the experts started doubting their sanity and assumed they must be misunderstanding the question. By extreme good luck, we have a control group we can use to answer this question. Several of the milestones were first achieved by ChatGPT, which came out just three months after the survey ended. These weren’t trick questions — they hadn’t been achieved as of survey release — but the correct answer would have been “basically immediately.” Did the experts get this correct answer?
No. The judges ruled that ChatGPT satisfied five new milestones. The experts’ prediction for how long it would take an AI to achieve these milestones (remember, the right answer was three months) were five, four, five, 10, and nine years — about the same as they gave any other hard problem.
And there was a truly abysmal correlation (around 0.1-0.2, depending on how you calculate it) between the tasks experts thought would be solved fastest, and the ones that actually got solved. The task experts thought would fall soonest was — once again — Angry Birds. And among the tasks that have remained unconquered, even as AI has made astounding progress in so many other areas of life is — once again — Angry Birds.
(The transhumanists say that one day superintelligent AIs running on cryogenic brains the size of Jupiter will grant us nanotechnology, interstellar travel, and even immortality. The most trollish outcome — and the outcome toward which we are currently heading — is that those vast, semidivine artifact-minds still won’t be able to beat us at Angry Birds.)
This exceptionally poor round of new predictions looks even worse when viewed beside their past successes. In 2016, respondents predicted AI would be able to write high school essays that would receive high grades in 2023 (i.e., exactly right). In 2022, their median prediction extended out to 2025. How did they get so much worse?
Doubt Creeps In
In retrospect, the seemingly accurate 2016 survey had some red flags.
The survey asked the same questions in multiple different ways. For example, “When do you think there’s a 50% chance AI will be able to classify images?” and “How likely is it that AI can classify images in ten years?” The answers should line up: If experts give a 50% probability of AI classifying images in 10 years, the chance of AI classifying images in 10 years should be 50%. It wasn’t. In this particular case, experts asked when AI would have a 50% chance of classifying images answered 2020; when asked their chance of AI classifying images in 2026, they said 50%.
The survey’s most dramatic question — when AI would reach “human level” — was worst of all. Katja asked the question in two different ways:
1. When AI would achieve high-level machine intelligence, defined as “when unaided machines can accomplish every task better and more cheaply than human workers.”
2. At the end of a list of questions about specific occupations, the survey asked when all occupations could be fully automated, defined as “when for any occupation, machines could be built to carry out the task better and more cheaply than human workers.”
In her write-up, Katja herself described these as different ways of asking the same question, meant to investigate framing effects. But for framing 1, the median answer was 2061. For framing 2, the median answer was 2138.
Most people don’t have clear, well-thought-out answers to most questions. Famously, respondents to a 2010 poll found that more people supported gays’ right to serve in the military than supported homosexuals’ right to serve in the military. I don’t think people were confused about whether gays were homosexual or not. I think they generated an opinion on the fly, and the use of a slightly friendlier-sounding or scarier-sounding term influenced which opinion they generated. The exact wording wouldn’t shift the mind of a gay rights zealot or an inveterate homophobe, but people on the margin with no clear opinion could be pushed one way or the other.
But this was more than a push: AGI in 45 years vs. 122 years is a big difference!
Gay rights are at least grounded in real people and political or religious principles we’ve probably already considered. But who knows when human-level AI will happen? Many of these experts were people who invented a new computer vision program or helped robot arms assemble cars. They might never have thought about the problem in these exact terms before; certainly they wouldn’t have complex mental models. These are the kinds of conditions where little changes in wording can have big effects.
There’s an energy wonk joke that “fusion power is 30 years in the future and always will be.” The AI version is Platt’s Law, named for Charles Platt, who observed that all forecasts for transformative AI are about 30 years away from the forecasting date. Thirty years away is far enough that nobody’s going to ask you which existing lines of research could produce breakthroughs so quickly, but close enough that it doesn’t sound like you’re positing some specific obstacle that nobody will ever be able to overcome. It’s within the lifetime of the listeners (and therefore interesting), but probably outside the career of the forecaster (so they can’t be called on it). If you don’t have any idea and just want to signal that AI is far but not impossible, 30 years is a great guess!
Katja’s survey didn’t quite hit Platt’s Law — her respondents answered 45 years on one framing, 122 years on another. But I wonder if Platt’s reasoning style — what kind of distance from the present sounds “reasonable,” what numbers will correctly signal support for science and innovation and the human spirit without making you sound like a rosy-eyed optimist who expects miracles — is a more useful framework than the naive model where forecasters simply consult their domain expertise and get the right answer.
Regardless of what particular year it is, saying the same number signals the same thing. If “this problem seems hard, but not impossible, and I support the researchers working on it” is best signaled by providing a six-year timeline, this will be equally true in 2016 and 2022. If you ask someone in 2016, they’ll say it will happen in 2022. If you ask them in 2022, they’ll say it will happen in 2028. If in fact it happens in 2023, the people who you asked in 2016 will look prescient, and the people who you asked in 2022 will look like morons. Is that what happened here?
This table shows the rate at which different predictions advanced from 2016 to 2022. An advance of zero years means the experts’ prediction stayed stable — for example, in 2016, they said it would happen in 2050, and in 2022, they still said it would happen in 2050. An advance of six years means they’re just kicking the can down the road — for example, if in 2016 they said it would happen in 2050, and then in 2022 they said it would happen in 2056.
The mean advance on these milestones was about one year. But this was heavily influenced by three outliers, shown as -29, -24, and -14 above. The median is less sensitive to outliers —- and it was three years. That is, over six years, the date that experts predicted we would achieve the milestones advanced three years. So we’re about halfway between the perfect world where everyone predicts the same year regardless of when you ask them (barring actual new information), and the Platt’s Law world where everyone predicts the same distance away no matter what year you ask the question in.
In the 2016 survey, this tendency didn’t hurt. Experts predicted the easy-sounding things were about three years away, the medium-sounding things five to 10 years away, and the hard-sounding things about 50 years away. In the 2022 survey, they did the same. Unfortunately for them, in 2022 the medium-sounding things were only months away, or had already been achieved, and their seemingly good performance fell apart.
It seems like most of the AI experts weren’t prepared for difficult prediction questions. What if we asked prediction experts?
Metaculus is a cross between a website and a giant multi-year, several-thousand-question forecasting tournament. You register and make predictions about things. Most of them are simple things that will happen in a month or a year. When a month or a year goes by, the site grades your prediction and grants or fines you points based on how you did compared to other players.
The fun part is the Metaculus Prediction for each question. It’s not just the average forecast of everyone playing that question, it’s the average forecast weighted by how often each forecaster has been right before.
Some Metaculans are “superforecasters,” University of Pennsylvania professor Philip Tetlock’s term for prognosticators with an uncanny knack for making good guesses on questions like these. Superforecasters might not always be experts in the domains they’re making predictions in (though they sometimes are!), but they make up for it by avoiding biases and failure modes like the ones that plagued the experts above. Whatever the weighting algorithm, it will probably disproportionately capture this upper crust of users.
Is AI Harder To Forecast Than Other Things? Let’s Find Out!
Metaculus has dozens of questions about AI, including the inevitable Angry Birds forecast.
Because everyone’s scores are tracked, well, meticulously, it has great data on how these forecasts have gone in the past. Forecaster Vasco Grilo has collected data on how Metaculus has done predicting 1,373 different binary yes-or-no questions (like “Will Trump win the election?”). Fifty-six of these questions are about AI (like “Will Google or DeepMind release an API for a large language model before April?”). He found that for both AI categories and all categories, Metaculus’s forecasts did much better than Laplace’s rule of succession (a formula for predicting the likelihood of a specific event in a sequence, based on how frequently that event occurred in the past). But the effect was weaker for AI-related questions (score difference of 0.88) than for all questions (score difference of 1.25).
So Metaculus forecasts are definitely better than nothing (including on AI). But the AI forecasts are less accurate than other forecasts: The score improvement between the guess and the forecast is only about half as big. Does this mean that forecasting AI is especially hard? Not necessarily. It could be that Metaculus chooses harder questions for AI, or that Metaculus users are experts in other things but not in AI. But the data is definitely consistent with that story.
Okay, But When Will We Have Human-Level AI?
The two most popular AI questions on Metaculus, with thousands of individual forecasts, are on “general AI” (i.e., AI that can perform a wide variety of tasks, just like humans).
The first question (“Easy”) asks about an AI that can pass the SAT, interpret ambiguous sentences, and play video games. The second (“Hard”) asks about an AI that can answer expert-level questions on any subject, pass programming interviews, and assemble a Lego set. Both questions also require the AI to be able to pass a Turing test and explain all its choices to the judges. These are lower bars than Katja’s question about an “AGI that can do all human tasks,” but not by much — in another question, the forecasters predict it will only be one to five years between AIs that beat the first two questions and AIs that can beat humans at everything.
Although Easy is a little older than Hard, since both questions have existed they’ve more or less moved together, suggesting that the movements reflect AI progress in general and not the specific bundle of tasks involved.
Easy starts at 2055, drops to 2033 after GPT-3, then starts rising again. It stays high until early 2021, then has another precipitous drop around April 2022, after which it stays about the same — neither ChatGPT nor GPT-4 affects it very much. So what happened in April 2022?
Most of the commenters blamed Google. In April 2022, the company released a paper describing its new language model PaLM. PaLM wasn’t any higher-tech than GPT-3, but it was trained on more powerful computers and therefore did a better job. The researchers showed that previously theoretical scaling laws —- rules governing how much smarter an AI gets on more powerful computers — appeared to hold.
Then in May, DeepMind released a paper describing a “generalist” model called Gato, writing that “the same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.”
Neither of these illuminated deep principles the same way GPT-2 and GPT-3 did, and neither caught the public eye the same way as ChatGPT and GPT-4. But this was when the Metaculus estimate plummeted. Some forecasters defended their decision to change their prediction in the comments. User TryingToPredictFuture:
The PaLM paper indicates that Google is now capable of efficiently converting its vast funds into smarter and smarter AIs in an almost fully automatic manner.
The process is not blocked by theoretical breakthroughs anymore. Google is now in the territory where they can massively improve the performance of their models on any NLP benchmark by simply adding enough TPUs. And there is no performance ceiling in sight, and no slowdown.
My update was based on the fact that GPT-3 and other papers at the time predicted a plausible seeming scaling law, but recent results actually confirm that scaling law continues (plus displays discontinuous improvement on some tasks). Even though these results were predictable, they still remove uncertainty.
Others found the sudden change indefensible, for example top-100 forecaster TemetNosce:
The community was wildly out of line with progress in the field beforehand, and arguably still are. Bluntly I’m more concerned with whether any given AI will do all these tests or get said statement than whether one could in the next decade. My default remains that it’ll happen sometime mid-late this decade.
Reading the comments, one cannot help but be impressed by this group of erudite people, collaborating and competing with each other to wring as much signal as possible from the noise. Some of the smartest people I know compete on Metaculus — and put immense effort into every aspect of the process (especially rules-lawyering the resolution criteria!).
But the result itself isn’t impressive at all. If we believe today’s estimate, then the estimate three years ago was 25 years off. Users appear to have over-updated on GPT-3, having slashed 20 years off their predicted resolution date — then added 15 of those years back for approximately no reason — then gone down even further than before on some papers which just confirmed what everybody was already kind of thinking.
I find OpenAI employee Daniel Kokotajlo’s summary of Metaculus’s AI forecasting more eloquent than anything I could come up with myself:
Sometimes updates happen not because of events, but rather because of thinking through the arguments more carefully and forming better models. Even this kind of update, however, often happens around the same time as splashy events, because the splashy events cause people to revisit their timelines, discuss timelines more with each other, etc.
(Speaking as someone who hasn’t updated as much on recent events due to having already had short timelines, but who hadn’t forecasted on this question for almost a year (EDIT: two years!) and then revisited it in April and May. Also an “event” that caused my shorter timelines was starting a job at OpenAl, but mostly it wasn’t the stuff I learned on the job, it was mostly just that I sat down and tried to build models and think through the question again seriously, and so there were new arguments considered and new phenomena modelled.)
Maybe (some people started thinking around 2020) people’s random guesses about when we’ll get AGI are just random guesses. Maybe this is true even if the people are very smart, or even if we average together many people’s random guesses into one median random guess. Maybe we need to actually think deeply about the specifics of the problem.
One group of people thinking about this was Open Philanthropy, a charitable foundation which (among many other things) tries to steer AI progress in a beneficial direction. They asked their resident expert Ajeya Cotra to prepare a report on the topic, and got “Forecasting Transformative AI With Biological Anchors” (“transformative AI” is AI that can do everything as well as humans).
The report is very complicated, and I explain it at greater length on my blog. The very short version: Suppose that in order to be as smart as humans, AI needs as much computing power as the human brain. In order to train an AI with as much computing power as the human brain, we would need a very, very powerful computer — one with much more computing power than the human brain. No existing computer or cluster of computers is anywhere near that powerful. To build a computer that powerful would take trillions of dollars — more than the entire U.S. GDP.
But every year, computers get better and cheaper, so the amount of money it takes to build the giant-AI-training computer goes down. And every year, the economy grows, and people become more interested in AI, so the amount of money people are willing to spend goes up. So at some point, the giant-AI-training computer will cost some amount that some group is willing to spend, they will build the giant-AI-training computer, it will train an AI with the same computing power as the human brain, and maybe that AI will be as smart as humans.
Is this the right way to think about AI? Don’t we need to actually understand what we’re doing in order to get human-level AI, not just build a really big computer? Didn’t the Wright brothers have to grasp the basic principles of flight instead of just building something with the same wingspan as birds? Ajeya isn’t unaware of these objections; the report addresses them at length and tries to argue why computing power will be the dominant consideration. I find her answers convincing. But also, if you’re trying to do a deep specific model instead of making random guesses, these are the kind of assumptions you have to make.
Ajeya goes on to come up with best guesses for the free parameters in her model, including:
How much computing power does the human brain have, anyway?
Are artificial devices about as efficient as natural ones, or should we expect computers to take more/less computing power than brains to reach the same intelligence?
It takes more computing power to train an AI than the AI itself uses, but how much more?
How quickly are computers getting faster and cheaper? Will this continue into the future?
How quickly is the economy growing? Will this continue into the future?
How quickly are people becoming more interested in AI? Will this continue into the future?
… and finds that on average we get human-level AI in 2052:
Ajeya wrote her report in 2020, when the Metaculus questions for AI were reading late 2030s and early 2040s, and when Katja’s experts were predicting the 2060s; all three forecasts were clustered together (and all much earlier than the popular mood, according to which it would never happen, or might take centuries).
In 2022, when Metaculus had updated to the late 2020s or early 2030s, and Katja’s experts had updated to the 2050s (remember, all of these people are predicting slightly different questions), Ajeya posted “Two-Year Update on My Personal AI Timelines,” saying that her own numbers had updated to a median of 2040. She gave four reasons, of which one and a half sort of boiled down to “seeing GPT be more impressive than expected,” one was lowering her bar for transformative AI, and one and a half were fixing other parameters of her model (for example, she had originally overestimated the cost of compute in 2020).
It’s good that she updates when she finds new information. Still, part of what I wanted from an explicit model was a way to not be pushed back and forth by the shifting tides of year-to-year news and “buzz.” If there is a way to avoid that, we will not find it here.
In some sense, since transformative AI has not been invented yet, we cannot grade forecasts about it.
But we can look at whether the same forecasters did a good job forecasting other AI advances, whether their forecasts are internally consistent, and how their forecasts have shifted over time. None of the three forecasting methods look great on these intermediate goals.
Katja’s survey shifted its headline date very little over the course of its six-year existence. But it shows wild inconsistency among different framings of the same data, and gets its intermediate endpoints wrong — sometimes so wrong it fails to notice when things have already happened.
Metaculus’s tournament shifted its headline date by 15 years over the three years it’s been running, and its own commenters often seem confused about why the date is going up or down. Ajeya’s model in some sense did the best, staying self-consistent and shifting its headline date by only 12 years. But this isn’t really a meaningful victory; it’s just a measure of how one forecaster voluntarily graded her own estimates.
In a situation like this, it’s tempting to ask whether forecasting transformative AI gives us any signal at all. Could we profitably replace this whole 5,000-word article with the words WE DON’T KNOW written in really big letters?
I want to tentatively argue no, for three reasons.
First, in the past, these kinds of forecasts have provided more than zero information. Even on Katja’s second survey, the one everyone failed at, there was a correlation of 0.1-0.2 — i.e., higher than zero — on which tasks the experts thought would be solved fastest, and which ones actually were. The Metaculus data show that its forecasts provide much more than literally zero information on binary questions.
Second, because as bad as these forecasts are, “better than literally zero information” is an easy bar to clear. Is it more likely that AI which can beat humans at everything will be invented 20 seconds from now, or 20 years from now? Most people would say 20 years from now; that is, in some sense, an “AI forecast.” Is it more likely 20 years from now, or 20 millennia from now? Again, if you have an opinion on this question, you’re making a forecast. Forecasts like the three in this article aren’t good enough to get a year-by-year resolution. But they all seem to agree that transformative AI is most likely in the period between about 10 and 40 years from now (except arguably the second framing of Katja’s survey). And they all seem to agree that over the past three years, we’ve gotten new information that’s made it look closer than it did before.
And third, because when people see a giant poster saying “WE DON’T KNOW,” they use it as an excuse to cheat. They think things like, “We don’t know that it’s definitely soon, therefore it must be late,” or, “We don’t know that it’s definitely late, therefore, it must be soon.” Nobody says they’re thinking this, but it seems like a hard failure mode for people to avoid.
Forecasts — even forecasts that span decades and swing back and forth more often than we might like — at least get our heads out of the clouds and into the real world where we have to talk about specific date ranges.
I worry that, even with the forecasts, people will cheat. They’ll use real but bidirectional uncertainty as an excuse to have uncertainty only in one direction. For example, they’ll say, “These forecasts suggest a date 10-40 years from now, but the article said these forecasts weren’t very good, and we all know that sometimes bad forecasters fall for hype about new technology, so we can conclude that it will be later than 10-40 years.” Or they’ll say, “These forecasts suggest a date 10-40 years from now, but the article said that these forecasts weren’t very good, and we all know that sometimes bad forecasters have status quo bias and are totally blindsided by new things when they arrive, so we can conclude that it will be sooner than 10-40 years.”
I’m against this because I constantly see both sides (sooner vs. later) assume the other has a bias and their own doesn’t. But also because this is exactly the kind of information forecasters are trying to consider. I know some of the AI experts Katja surveyed, and they’re people who think pretty hard about their biases and the biases of others, and try to account for these biases in their work. I know some of the forecasters on Metaculus, and ditto. Ajeya has talked at length about all the biases she is worried that she could have had and how she adjusted for them. When you throw out these (admittedly bad) forecasts based on your view that they’re “too aggressive” or “too conservative,” you’re replacing hundreds of smart people’s guesses about what errors might be involved in each direction with your spur-of-the-moment guess.
So I claim that our canonical best guess, based on current forecasting methods, is that we will develop “transformative AI” able to do anything humans can do sometime between 10 and 40 years from now. These forecasts aren’t very good, but unless you have more expertise than the experts, are more super than the superforecasters, or have a more detailed model than the modelers, your attempt to invent a different number on the spot to compensate for their supposed biases will be even worse.
We should, as a civilization, operate under the assumption that transformative AI will arrive 10-40 years from now, with a wide range for error in either direction.
Scott Alexander is a writer and psychiatrist based in Oakland, California. He blogs at astralcodexten.substack.com.
Published June 2023
Have something to say? Email us at [email protected].