When the cofounders of Gradient Labs met working at the UK neobank Monzo, they couldn’t have foreseen joining forces to build an autonomous AI agent for financial services. But when ChatGPT 3.5 was released in 2022, the potential was obvious.
At the time, even the most advanced customer support agents were only able to handle the simplest 25% of queries.
“We realised there was a huge amount of efficiency that could be gained,” says Danai Antoniou, cofounder and chief scientist of Gradient Labs. “Traditional language models cannot do fraud investigations, for example. They are too trivial and too basic to be able to do this intuition-based and very complicated, multi-dimensional problem solving, which humans are very good at.”
After leaving Monzo, the trio spent 15 months building their agent Otto. Early feedback has been very encouraging — 90% of customer queries are resolved out of the box, with customer satisfaction (CSAT) scores of 80-90%, which is on par or better than human customer service teams. Banks report their support costs have been reduced by up to 75%, with faster response times.
Buoyed by this early success, the founding Gradient Labs team recently closed a €11.08m Series A round and plan to expand into Europe and the US.
But that success has only been possible because they’ve invested in reliable infrastructure.
“That’s been critical for us because if you’re doing support for a bank, you cannot go down,” says Antoniou. “We do not rely on a single provider. That also allows you to work around inference limitations. It’s been an interesting problem that we’ve had to think through from the beginning.”
The challenge
The surge of AI workloads has thrown a new technology challenge into the spotlight — inference. That’s the process by which an AI model applies what it has learned during the training phase to make decisions based on new, real-world data.
The amount of inference compute needed is already 100x more.
Analysts at Morgan Stanley estimate more than 75% of power and computational demand in the US will be for inference in the coming years. Even Nvidia CEO Jensen Huang nodded to this surge in an earnings call in February.
“The amount of inference compute needed is already 100x more” than it was when large language models started out, Huang said. “And that’s just the beginning.”
For those building AI applications, inference matters a lot.
“There are three main demands,” Anton Osika, cofounder of Lovable, says about inference challenges for tech teams. The Swedish “vibe coding” startup became a unicorn eight months after its launch, when it recently closed a $200m Series A round.
“First and foremost is speed. Both raw throughput but also how much reasoning models are able to actually do. Secondly, caching is super important for agentic flows when there’s lots of back and forth between the model and the wider environment. And finally capacity. Getting access to the latest models and provisioning enough tokens isn’t easy at scale,” he says.
Demands are scaling fast
As AI adoption grows, inference workloads must also scale without overwhelming infrastructure or driving up costs. Researchers at Nebius, a global AI infrastructure company with seven data centres in Europe, the US and Israel, have inference efficiency high on the priority list.
This isn’t just about cost optimisation but also ensuring business continuity.
“The tasks we’re tackling today are relatively simple,” says Danila Shtan, Nebius’s chief technology officer. “In the future, we’ll see more complex scenarios that will require more computation resources and complex systems to enable them.”
The most advanced capabilities have traditionally been confined to closed ecosystems. However, that’s beginning to shift, and as inference demands scale fast, many startups are finding the cost of those proprietary systems unsustainable.
This has created an opportunity for platforms that make open-source alternatives. Nebius’s inference-as-as-service platform AI Studio, for example, provides access to and enables the deployment and scaling of open-source models like the latest releases from Llama, DeepSeek, Qwen and OpenAI.
“Most startups in their early stages rely on state-of-the-art models, which are always closed ecosystems with their own inference,” Shtan says. “But that becomes very expensive really fast and founders start to look for alternatives. They’ll train their own models or work with open-source models to alleviate some of that pressure. There are already a lot of capable open-source models out there and in the future, there will be more.”
It’s also a good way to maintain flexibility across providers. Shtan points to a recent controversy in the US when Anthropic cut Windsurf’s access to its Claude 3.x model, right before OpenAI moved to acquire the AI coding platform for $3bn. The move left the Windsurf team scrambling to boost its inference capacity across its other providers.
For Shtan, the episode proves the need for startup founders to build contingency with open-source.
“You can’t rely on a closed-source ecosystem unless your plan is to be part of that ecosystem forever,” he says. “This isn’t just about cost optimisation but also ensuring business continuity.”
Loveable’s Osika agrees, stating that while the best coding models will likely remain closed source in the near term, open source is crucial for transparency, customisation and avoiding vendor lock-in.
“We aim to work with both,” he says. “The real value isn’t just in the model, but in the entire system we’re building around it that makes AI coding reliable, fast and secure.”
In the future, much of the inference efficiency a startup can access will come down to the capacity of the local infrastructure, Nebius’s Shtan adds. In December, the European Commission pledged €750m in funding to establish and maintain AI-optimised supercomputers for startups to train their AI models. Nvidia is also planning to boost the AI hardware capacity in Europe by three times next year, with the amount of AI computing capacity increasing by a factor of 10.
Locality matters
Nebius recently launched operations in the UK with the deployment of a major GPU cluster, built on Nvidia’s platform. It’s due to be fully operational by the end of this year.
If you’re talking about inference, the locality matters.
“If you’re talking about inference, the locality matters,” Shtan says. “With these complex systems, it’s going to be more and more expensive to go over the Atlantic. We definitely see the demand for local infrastructure growing, and that is a firm part of our strategy.”
That’s something Gradient Labs has needed to rely on too.
“When you’re dealing with UK banks for example, they may not want their data to go to the US,” Antoniou says. And while she doesn’t see any significant barriers to Europe’s ability to compete in AI on the world stage, she says startups need to be ready to evolve as fast as the technology does.
“There’s a lot of low-level computational optimisation research being done to make these models faster and much more efficient,” Antoniou says. “What we’re focused on is making sure our agent can be evaluated as quickly as possible on the next available model. We have to put ourselves in the best position to take advantage of the latest capabilities as soon as they come out.”
Read the orginal article: https://sifted.eu/articles/europes-ai-efficiency-brnd/