How Machine Learning Is Used to Manage Data Center Power Today
Here’s how solutions already on the market today are using ML to improve data center uptime and efficiency.
It’s no secret that data centers are getting increasingly complicated. There are more types of hardware and management software, more frequently changing workloads, and public cloud. And with edge computing just around the corner, things are about to get even more complicated.
Many in the industry expect machine learning to make data center managers’ lives easier in the face of all this complexity. Several companies already sell data center management software that uses machine learning algorithms. Some are tackling the problem from a holistic, data center-as-a-computer perspective, while others choose to focus on just cooling, or just power. While cooling is where much of the energy is wasted in inefficiently run facilities today, there’s a lot to be gained from applying smart software tools to management of electrical data center infrastructure.
A startup called Virtual Power Systems is using machine learning to fight what is often referred to as “stranded power” in data centers. It’s common for a data center to have an electrical system that’s designed to support more power load than necessary. Sometimes, it’s done by design, to ensure redundancy, and sometimes it happens because the designers couldn’t predict how the facility would be used in the future.
VPS’s “software-defined power” solution uses smart electrical hardware (including equipment from partners like Schneider Electric) with built-in batteries to effectively redistribute power more rationally throughout the data center. And it can do it dynamically, as needs change, the company says.
Called ICE, the software uses machine learning that makes power-requirement predictions (including battery management and the probability of power spikes) centrally and sends out profiles to the inference engine running in the hardware on the data center floor, which then tunes the power load available to each rack according to its real needs.
“When you have a redundant infrastructure, you have two lines of power coming into the rack, and you place your load in such a way that in case there is a failure, you can fail over from one to the other,” Karimulla Shaikh, VPS CTO, told us. “That means you’re using at most 50 percent of the capacity of each feed. By using our switch, you can use 100 percent of the load. If there’s a failure, the switch is smart and it’s able to jump in and move all the load onto the battery for a short time and then work with our software to move applications elsewhere or take the workloads offline.”
The machine learning model built by the software can also be used as an emulator, to understand how power delivery will be affected if you add more servers or racks.
But this is just the beginning. VPS is working with some customers on ways to avoid the typical redundant data center infrastructure design altogether, Shaikh said. It’s also looking at dynamic switching between data center energy sources, such as utility, fuel cells, and intermittent renewable sources.
Nlyte Software, whose data center infrastructure management (DCIM) software is what the company’s chief strategy officer Enzo Greco likened to a “real-time ERP (enterprise resource planning) for the data center,” recently added predictive thermal and power management capabilities using IBM Watson machine learning services to its solution. Watson helps it build a model based on data from sensors, equipment, and application workload information. In many cases, it’s already fairly easy to collect all the data, Greco said, so why not use more of it to your advantage?
Many facilities already have temperature and humidity sensors, real-time operational server data, and power meters. “The data is readily available from almost any modern piece of equipment, whether it’s a UPS or a PDU,” he said. The machine learning system can find hidden patterns and interactions between different systems and endpoints.
“We’re able to predict, at any point in the future, power anomalies at a server and rack level,” Greco said. In a stable state, a rack may be consuming 10kW, but at some point, it may spike to 15kW. “With enough historical data, you’re able to predict one hour in the future that this particular rack will consume 15kW of electricity.” The spike may be caused by a mechanical issue or an application. “Maybe you’re running SAP in batch mode, maybe your transaction systems are running at peak.”
If you can predict the spike, you can prepare for it by moving workloads, shutting down servers, or doing some preventative maintenance on UPS batteries, he said.
Most Nlyte customers are using the machine learning system to get alerts and understand potential problem areas. The software company is also developing predicted-failure and preventative-maintenance modules. “Power and thermals are extremely good leading indicators for failure prediction,” Greco said. “If you can predict power anomalies, that is a leading indicator that you may have an application issue, or your may have a mechanical issue.”
In addition to detecting anomalies and detecting them faster than human operators can, machine learning can help operators get a clearer picture of electrical infrastructure redundancy in their facilities. “A room may not be as power redundant as it has been designed to be because of, say, drift in operational practices,” Rhonda Ascierto, research VP at Uptime Institute, told us. “It’s about ensuring that each part of the facility is operating as you expect it to be in terms of the redundancy profile, despite the continually changing nature of these facilities.”
Overall machine learning has the potential to change data center availability strategy from a reactive to a proactive one. “A UPS is reactive; it waits until a power failure then fails over,” he explained. “When applications get recovered after a failure, that’s reactive. Becoming proactive means not waiting; it’s saying I will have a problem in the future, lets remediate it now.”