Last month, AI founders and investors told TechCrunch that we’re now in the “second era of scaling laws,” signaling a departure from the traditional methods of improving AI models, which are showing diminishing returns. Among the promising innovations is “test-time scaling,” a technique that seems to underpin the remarkable performance of OpenAI’s latest model, o3. However, as groundbreaking as o3 is, it comes with significant trade-offs.
Here's ads banner inside a post
The unveiling of OpenAI’s o3 model has sparked optimism across the AI community, dispelling fears that AI scaling progress had “hit a wall.” o3 has set new benchmarks, excelling in the ARC-AGI general ability test and achieving an unprecedented 25% on a complex math test—a performance that dwarfs other models, none of which exceeded 2%.
Here's ads banner inside a post
While the initial buzz is promising, TechCrunch remains cautious, as very few have had the opportunity to test o3. Even so, the AI world is abuzz with speculation that something transformative is happening.
A New Era of AI Scaling
Noam Brown, co-creator of OpenAI’s o-series models, highlighted the rapid progress in a tweet, pointing out that o3’s impressive gains came just three months after the release of o1. “We have every reason to believe this trajectory will continue,” Brown stated.
Here's ads banner inside a post
Jack Clark, co-founder of Anthropic, echoed this sentiment in a blog post, suggesting that o3’s release signals a faster pace of AI progress in 2025. While Anthropic stands to benefit from this narrative—particularly in terms of raising capital—Clark’s insights align with broader industry trends. He predicts that the AI community will combine test-time scaling with traditional pre-training methods next year, potentially setting the stage for even greater advances.
What is Test-Time Scaling?
Test-time scaling refers to the use of more computational resources during the inference phase—the period after a user inputs a prompt and waits for the AI’s response. Although OpenAI hasn’t disclosed all the technical details, it’s likely leveraging more powerful inference chips, deploying additional chips per task, or extending computation times significantly. Reports suggest that in some cases, o3 takes 10 to 15 minutes to produce an answer, a marked departure from earlier models.
While this approach improves model performance, it also raises costs. Jack Clark notes that o3’s reliance on test-time compute makes the cost of running AI systems less predictable. In previous models, costs were tied to pre-training and inference efficiency. Now, the compute-intensive nature of test-time scaling introduces variability.
Benchmarking Breakthroughs
One of the most striking indicators of o3’s capabilities is its performance on the ARC-AGI benchmark, a rigorous test designed to evaluate progress toward artificial general intelligence (AGI). Although passing this test doesn’t equate to achieving AGI, it provides a useful metric for measuring advancements. On one attempt, o3 scored 88%, surpassing OpenAI’s previous best—o1—which achieved just 32%.
However, the cost of this performance is staggering. The high-performing version of o3 reportedly used more than $1,000 worth of compute per task, compared to o1’s $5 per task and o1-mini’s mere cents. According to François Chollet, creator of the ARC-AGI benchmark, o3 required approximately 170 times more compute to achieve its top score than a slightly less resource-intensive version, which scored only 12% lower.
Chollet describes o3 as a “system capable of adapting to tasks it has never encountered before, approaching human-level performance in the ARC-AGI domain.” Yet he acknowledges the steep costs: “You could pay a human to solve ARC-AGI tasks for roughly $5 per task, consuming mere cents in energy.”
The Cost Conundrum
The price of running o3 raises critical questions. For instance, what is the model’s intended use case? And how much more compute will future iterations, like o4 or o5, require? Given the high costs, o3 seems impractical for everyday tasks like answering simple queries. Instead, its value might lie in addressing complex, strategic questions—for example, aiding a sports franchise’s general manager in long-term planning.
Institutions with significant budgets may be the primary users of o3. Ethan Mollick, a professor at Wharton, noted on Twitter that such models are currently beyond the reach of most users. OpenAI has already introduced a $200 tier for high-compute versions of o1, and there are reports of potential subscription plans costing up to $2,000—a price point that aligns with o3’s resource demands.
Challenges and Opportunities
Despite its advances, o3 is far from perfect. It still struggles with simple tasks that humans can solve effortlessly. This is symptomatic of a broader issue in large language models: hallucination. Even with test-time scaling, o3 and similar models are prone to errors, which is why disclaimers accompany their outputs. True AGI, if achieved, would likely overcome these limitations.
One avenue for improvement lies in developing better inference chips. Startups like Groq and Cerebras are working on more powerful chips, while companies like MatX focus on cost efficiency. Andreessen Horowitz’s Anjney Midha has suggested that these innovations could play a pivotal role in the future of test-time scaling.
The Road Ahead
OpenAI’s o3 is undeniably a milestone, demonstrating the potential of test-time compute to push AI models to new heights. However, its high resource requirements and associated costs highlight the challenges of sustaining such progress. While institutions with deep pockets may find o3 invaluable for specific applications, broader adoption remains limited by economics.
As the AI community explores new ways to scale, o3’s performance underscores the need for continued innovation in both technology and cost management. The model’s success adds weight to the argument that test-time compute could be the next frontier in AI scaling—but only time will tell if it’s a sustainable solution.