The AI race is not as close as it looks

A common narrative suggests the global AI race is now nearly balanced, with differences reduced to branding. But a closer look at how models are tested and perform in real-world situations tells a different story. The real question is not who scores highest, but what those scores actually measure.

By: Per Imer, CEO, Homerunner

Contains: 922 words

When the test becomes the curriculum

AI models are often compared using benchmarks, standardized tests designed to measure performance on specific tasks. One of the most widely used benchmarks in recent years has been SWE-bench, which evaluates how well models can solve software engineering problems and fix code issues.

Results have repeatedly appeared impressive, with each new model achieving higher scores than the last. This has helped fuel the perception of an extremely tight AI race, where the gap between leading laboratories seems almost nonexistent.

However, a fundamental issue arises when benchmark tasks have been publicly available for long periods. When questions exist openly online, they can find their way into training data. Models may then be optimized directly toward solving those exact tasks, intentionally or unintentionally.

It resembles preparing for an exam by studying previous exam papers and then receiving the same questions again. The score may be high, but it does not necessarily reflect deeper understanding.

New tests change the story

To address this challenge, an updated benchmark called SWE-rebench was introduced. The structure and difficulty remain similar, but the problems are drawn from newer repositories and consist of tasks models have not previously encountered.
Suddenly, the rankings changed.

Models from Anthropic, OpenAI, and Google placed at the top, scoring roughly between 51 and 53 percent. Several models that had previously appeared nearly equal on the older benchmark dropped significantly in performance.

One model that previously reported results near 80 percent scored below 40 percent on the new test.
This is not a marginal difference. It represents the gap between recognizing familiar problems and solving genuinely new ones.

What the difference actually reveals

This does not mean that certain AI ecosystems are weak or irrelevant. On the contrary, investment levels are massive, and development speed remains high across regions.

What it does show is that leaderboard numbers can be misleading. When access to advanced chips, GPU capacity, and research infrastructure is limited, it may be rational to optimize models toward the benchmarks everyone watches most closely. Doing so creates strong visible results and the impression of technological parity.

But when tests change and problems become unfamiliar, differences emerge. Frontier models require enormous computational resources, deep research environments, and a culture capable of building new architectures from first principles rather than optimizing existing ones.
Benchmark optimization alone cannot replace that foundation.

Why this matters for companies

For companies building with AI, the key question is not who tops a leaderboard at a given moment. The critical factor is robustness.
Can the model handle situations it has never seen before? Can it generalize across contexts? Can it operate reliably within complex operational systems where data is imperfect and conditions constantly change?

In logistics, new scenarios appear continuously. Delivery patterns evolve, unexpected errors occur, B2B workflows vary, and real-world data always contains exceptions.

Being optimized for a single test offers little advantage in such environments. What matters is the ability to apply reasoning broadly and adapt intelligently to new conditions.

The real distance

When models are evaluated using fresh and unseen tasks, current results suggest that leading American laboratories maintain an advantage of approximately six to twelve months.
This gap is not permanent. Technology evolves rapidly, and shifts can occur unexpectedly. Yet in an industry where six months can reshape product strategies, investment decisions, and architectural choices, such a lead is significant.

It influences which platforms organizations build upon and which capabilities are practically available today rather than theoretically possible tomorrow.

Benchmarks only matter if they measure the right thing

Benchmarks themselves are not the problem. They are essential tools for comparison and progress within a rapidly evolving field.
The problem emerges when the test becomes the goal instead of learning. When models are optimized toward known tasks, evaluation risks measuring memorization rather than understanding.

The real value of AI lies not in answering familiar questions correctly but in addressing situations that have never occurred before. Generalization is therefore the true metric.

The simple conclusion

The AI race is far from decided. Innovation is global, and competition is intense. But it is also not as close as certain benchmark numbers have suggested.
There is a crucial difference between producing correct answers and solving new problems.

For companies, this means technology choices should not be based solely on leaderboard rankings but on the ability to create stable, adaptive, and resilient systems over time. Benchmarks are useful only when they test what truly matters.

The ability to generalize.

That ability will ultimately determine who builds lasting systems and who merely scores highly on the wrong test.

Company:
Your name:
E-mail:
Phone:
Yearly amount of parcels:

Thank you for your request. We will get back to you.

(within 24 hours on business days)
Oops! Something went wrong while submitting the form.