With the launch of Nvidia’s latest generation of GPUs, liquid cooling has moved from a nice-to-have to a necessity for many firms. Chips that are simply too potent to cool through air-based methods are driving the adoption of direct-to-chip and rear-door heat exchangers.
As data center operators serving hyperscalers, GPU cloud providers, and leading enterprises look to adapt their offerings to cater to increasingly dense workloads, changes will need to be made around how companies approach resilience, redundancy, and uptime.
Is liquid more or less resilient than air?
Data center operators across the board are rolling out liquid-cooled enabled data halls, and many hyperscalers are developing bespoke liquid-cooled systems in their own data centers.
The need to adopt liquid for today’s high-end hardware is simply one of physics. Air as a cooling medium has a physical limit and, after a certain point, no amount of cold air can be passed over chips quickly enough to keep the hardware within the required temperature thresholds.
“Data center operators have done everything they can with air-based cooling systems,” says Ben Coughlin, chairman and co-founder of liquid-cooled data center operator Colovore. “It reaches a tipping point where the physics are too impossible to ignore, and you just need to have liquid as the cooling medium.”
While moving to liquid might be a necessity, it comes with new difficulties to ensure that the same ‘five-nines’ level of uptime customers have come to expect.
Vlad-Gabriel Anghel, director of solutions engineering at DCD’s training unit, DCD>Academy, notes that while there will be fewer fans and moving parts to maintain, operators are now relying on a continuously moving liquid to maintain the environmental conditions that the hardware needs in order to function correctly.
“Robustness will not come from one system being used over another but through it being engineered into the wider system,” he says.
Colovore’s Coughlin adds: “There isn't a playbook of what liquid cooling is. There are lots of different ways to deliver it; rear door heat exchangers, direct liquid connections into server chassis, or water delivered by a cooling distribution unit (CDU) either in the cabinet or a floor-mounted CDU.”
How liquid changes resilience and redundancy
NTT GDC has more than 200MW of liquid-cooled capacity (mostly direct-to- chip) rolled out at half a dozen data centers across the US – one of the largest such footprints globally.
Bruno Berti, SVP of global product management at NTT Global Data Centers, tells DCD that, to date, its deployments have not encountered any major failures.
“We have a lot less time to accommodate any type of stoppage of that liquid, so we need to ensure that that liquid continues to flow to continually extract that heat,” he says. “Resiliency in liquid distribution systems is more important than redundancy in an air-based system. So it's definitely something that's been in the forefront of our design and engineering teams' minds.”
DCD>Academy’s Anghel tells us that thermal inertia is considerably smaller in liquid cooling systems compared to air cooling which means even a very brief disruption in water flow can lead to massive temperature spikes, which can cause hardware to throttle or potentially lead to permanent damage.
“The speed at which heat builds up in liquid cooling system makes cooling failures much more critical than with air cooling,” he says. “Without adequate redundancy and proactive monitoring, the risk of an outage is significantly greater in this new era of high-performance computing.”
“While the core goals of reliability and uptime remain the same, liquid cooling necessitates a more integrated approach that combines physical redundancy, advanced monitoring, and automation,” adds Anghel. “While the principles of 2N and N+1 redundancy still apply, they must be adapted to address the unique vulnerabilities of liquid cooling, such as rapid thermal buildup, coolant leaks, and flow disruptions.”
To do this, he says, operators need to prioritize redundant cooling loops, pumps, and heat exchangers, ensuring these systems are fully independent to avoid cascading failures. Real-time monitoring of flow rates, pressure, and temperatures, coupled with automated failover mechanisms, is critical to maintaining uptime in the event of a fault.
Designing liquid systems for uptime
California-based Colovore is a longstanding advocate of liquid cooling. The company launched its first all-liquid facility in Santa Clara more than a decade ago. The company is about to launch a second data center next door and is planning more in Reno and Chicago.
On its site, the company claims the original facility has maintained 100 percent uptime throughout its operation, offering densities up to 250kW through rear door heat exchangers and direct-to-chip cooling.
“I don't think there's a different way of thinking about it,” says Colovore’s Coughlin. “You still have the base fundamental design considerations of delivering power and cooling to a cabinet.”
“Traditional air-based systems have to be able to be concurrently maintained, and if you've got an issue with a certain part of the floor you want to isolate that,” he adds. “You still have to think those issues through in a liquid-cooled system too; like the water distribution and having redundant pathways and being able to close off certain valves and taps in certain situations.”
While the fundamental goals might be the same, there are still practical aspects to consider. One of the major considerations around the resiliency of water loops is whether to build a system with high-pressure water flow rates or low-pressure water flow rates, and very large piping or narrow pipes. There are factors around materials to consider; should a rubber hose or a silver metal braided hose be used?
Coughlin says the company has always erred on the side of using” very industrial very, very thick components, but there are a lot of providers who are using plastic hosing in rubber tubes to distribute water.”
“Generally, we design our system with larger pipes and lower flow rates, because we believe that's a safer approach,” says Colovore’s Coughlin. “High-pressure water flows, to the extent you ever had a leak or a break, can escalate into bigger issues.”
“We've always erred on the side of using very industrial very, very thick components, but there are a lot of providers who are using plastic hosing in rubber tubes to distribute water,” he adds.
“Broadly speaking, it does seem like the industry is trending towards warmer inlet temperatures of water and lower flow rates, which lines up well with how we've designed our system.”
Berti says NTT has a flexible design that can mix and match air and liquid cooling, allowing it to deploy CDUs where fans would otherwise be. The company’s experience with primary water loops for the air-cooled chillers meant NTT was comfortable with the idea of a secondary loop for the liquid-based systems.
“From a resiliency perspective, I think liquid is more complicated, and has more failure scenarios,” he says. “You've got a lot of piping inside the data center; a lot of joints, a lot of moving parts, a lot of potential for leaks. There are also a lot of valves that people are plugging in and out.
“We've got multiple CDUs in an N+1 type configuration. If any pump goes down, there are always going to be more CDUs to handle any failure and fault scenario, and we have a dual-loop system that allows us to isolate that loop at any point.”
Isolation is important. Berti notes some customers and other colo providers might be happy just running a water loop to the racks; but this means if there’s a leak, you have to shut down that whole thing.
“If there's a leak somewhere, we have multiple places where we can intercept that loop and then basically put it on the secondary loop,” Berti says. “And because we can isolate that loop. We'll be able to add CDUs later on in the process as well.”
“That's important in the resiliency piece; how do you do maintenance on these inherently less resilient pieces of equipment.”
Liquid SLAs: do the guarantees need to change?
While standards are emerging, the industry is still settling on industry standards and best practices for many aspects of liquid cooling.
“There is still quite a bit of work to be done to determine what the industry standards would be,” says Coughlin. “There isn't that playbook of ‘here's how I design a liquid-cooled data center.’ We're working through this in real-time.”
“It is ultimately driven by those server platforms and what their requirements are. Until we get some consistency from the hardware side, there's always going to be some degree of customization involved, because we don't have uniformity in those standards.”
Beyond standards, questions over that guarantees operators need to offer around liquid systems are still yet to be fully answered.
NTT’s Berti says the company is updating and redefining a number of SLAs for liquid-cooled deployments and has updated some contracts with clients.
“We're coming up with a whole new set of SLAs,” he explains. “Flow rate and differential pressure between the supply and return of the CDUs are SLAs that are being defined. The water temperatures, both primary and secondary, are being updated. Secondary liquid temperature, secondary flow rate, secondary return, and supply differential pressure. Those are the ones that we're working on right now with our clients, redefining, and putting into standard SLAs.”
“We never used to have to SLA the primary water loop temperatures because it was the air that dictated what temperature we could run the liquid at. But now it matters because of the heat extraction requirements that are on the primary side.”
SLAs for liquid-cooled systems “must go beyond the typical guarantees of uptime and equipment reliability seen with air-cooled systems,” says DCD>Academy’s Anghel. These should include, he says, requirements for flow rates, temperature ranges, redundancy, monitoring, and maintenance specific to liquid cooling.
For retail colocation provider Colovore, however, things may be different.
“The critical variables are ultimately what’s required by the hardware platforms,” Coughlin says. “To some degree, it's already built into the equation. If you can't meet those [hardware] specs, there's no point having an SLA [because] you can't do it.”
He acknowledges that may change in the future. But for now, he says, the company’s customer base “doesn't ultimately care too much” about the medium.
“We have to deliver power and cooling to our customer environments. Those end metrics and requirements from an SLA perspective, haven't changed. The cooling medium might be slightly different, but, from the customer's perspective, that doesn't really matter. ”
There are also questions about who owns that water infrastructure, especially the CDUs. Some customers may want that infrastructure managed by the data center operator (or perhaps the CDU provider), but some of that may be located in areas where facilities staff might traditionally not be authorized to enter.
Does it matter in a failover-centric estate?
Most liquid cooling technology has matured in the realm of high-performance computing.
Jacqueline Davis, research analyst at the Uptime Institute, suggests that because of this, the liquid cooling landscape is still speaking that language of uptime and mean time between failures, rather than redundancy.
“At present, liquid cooling is going where needed to solve thermal challenges where operators have few other choices. It's not being chosen primarily for reasons of efficiency,” says Davis. “They can choose very durable pumps, fluid connections, etc, to try and get the best uptime and the best average mean time between failures. But it's harder to build things redundant.”
That so much GPU hardware and therefore liquid cooling is still currently focused on training – not unlike the batch computing mindset of HPC deployments – means workload outages don’t usually mean an immediate loss of service or revenue. Because of that, the money from vendors isn’t going into products that can offer the highest level of redundancy.
“I don't think they're targeting trying to meet conventional business IT at the redundancy standards that they've had in the past,” says Davis. “Because liquid cooling is still primarily in the HPC and AI world, it's not yet being asked to meet that need of redundancy.”
She suggests that workloads that require both high availability and liquid cooling might be better off focusing on resiliency in the software layer rather than in the cooling hardware. Better to failover to another server or rack than try to fail over to another pump or pipe.
Despite the high price of GPUs and the difficulty sourcing them, Davis suggests that hyperscalers are well-versed in this kind of failover thinking and will have the funds to take the hit because the target is speed to a finished product i.e. a trained model.
Likewise, she suggests the smaller GPU cloud providers might simply be willing to risk running GPUs hotter and leaner in the near term to achieve similar goals.
After that, what happens when we move on from training models to day-to-day inferencing?
Microsoft recently suggested it is seeing huge demand for its GPUs for inferencing, but the market is yet to fully shake out on what hardware will do much of the day-to-day AI legwork – and therefore what kind of cooling will be required.
“Some people will choose to run their inferencing on CPU hardware,” says Davis. “Some will run it on some more application-specific silicon. There's going to be a diverse set of hardware serving inferencing. Some of that will be liquid-cooled, some of it not.”
What about the rest of the industry?
Beyond the high-end GPUs and AI-specific chips, questions remain about whether other types of compute will require liquid cooling.
Uptime’s Davis predicts the industry will see “significant” segmentation, where extreme densities are reserved for AI training for “quite a while” yet.
For non-AI workloads, the jury is still out. Colovore’s Coughlin suggests there are plenty of data-intensive applications and services that customers are deploying and utilizing that don't require GPUs and don't require AI.
“We've got a number of healthcare data providers that are using CPU servers, not even GPUs, and a fully-packed cabinet of most modern CPU server systems hits 25kW.”
He notes a rack full of some of the highest-density storage systems could also reach north of 20kW; not quite at the tipping point of absolutely needing liquid, but not far off. Networking, while getting denser, is still some way off needing liquid.
“We're not there yet,” Coughlin says. “Will we get to a point where they need to be liquid-cooled? I don't know, all these pieces of underlying hardware have gotten more and more dense.”
Davis suggests conventional business IT currently lacks the overall incentive to densify. And while some hardware OEMs are offering increasingly dense storage and networking devices, she thinks they’re likely to remain the preserve of the small minority at the top for a time to come.
“It's going to be several years before those super high-powered CPUs and really high-density products really take up in large volume.”
If liquid cooling becomes standard across more facilities, it’s entirely possible that all-liquid halls like Colovore’s might become more common – and drag non-GPU workloads into being liquid-cooled by default.
Other providers, however, remain set on offering flexibility, which means air-cooling will remain viable for a while to come. Resilience and redundancy, however, will remain top of mind, whatever the medium.
More in Cooling
Sponsored
Read the orginal article: https://www.datacenterdynamics.com/en/analysis/how-liquid-cooling-impacts-uptime-thinking/