Energy and Carbon Aware Computing

Power Resilient NextG Data Centers

NextG data centers will operate under tight and variable power envelopes. Renewable energy becoming a primary power source and an increase in extreme weather have led to more power supply variability. At the same time, fluctuating data center utilization caused by mobile user churn increases power demand variability. As data centers increasingly oversubscribe power infrastructure in order to limit cost, power demand and supply have to be kept in balance at increasingly short timescales. NextG applications are often deployed in edge data centers, where large-scale power supply redundancy and power storage are often out of reach, and there is little to no control over the shared power distribution infrastructure. Thus, the internal resiliency of data centers to power events is paramount for the availability of nextG applications.

We are building and evaluating a data center power control plane (PCP). PCP can control power demand at a fine granularity and over short timescales by making it software-defined. The key is to gracefully trade off power and quality of service over time. This allows PCP to shed or consolidate load to less power-intensive processors to conserve power during a power event. We expect that PCP will reduce recovery latency from power disasters from weeks to seconds, outlive long, multi-day power disruptions, and improve the energy efficiency of nextG data centers by at least 10x via power oversubscription.

As a foundational component of PCP, we have developed a prototype power resilient distributed file system that establishes the viability of the idea, leading the way into our proposed research. Our prototype provides low-latency fail-over and recovery for load shedding events enabling fine-grained load control by PCP. At the same time, our work raises important research questions that must be answered for PCP to be practical, such as how to scale power demand and supply instrumentation and control; how to leverage the diverse power envelopes of various processing devices available in a data center; how to determine sheddable load and how to control it transparently; how to further reduce fail-over and recovery latency from power outages.

As an example, we have developed a prototype SmartNIC offload of Assise. We evaluate this prototype’s availability with a reboot experiment. We run Varmail of the Filebench suite on a primary DFS replica. The prototype replicates the per-process log data to two replicas, replica-1 and replica-2. While running Varmail, we reboot replica-1’s host and inform its NICFS of the lost host CPUs. The graph demonstrates that a SmartNIC side NICFS can indeed assume DFS service when the host CPU is powered down and provide enhanced availability for critical nextG applications in power-limited scenarios.

People

Faculty

Tom Anderson
Simon Peter

Postdoc

Jonggyu Park

Acknowledgements

This material is based upon work supported by the National Science Foundation under grant no. 2148209 and is supported in part by funds from federal agency and industry partners as specified in the Resilient & Intelligent NextG Systems (RINGS) program.