Energy and Carbon Aware Computing
Power Resilient NextG Data Centers
NextG data centers will operate under tight and variable power envelopes. Renewable energy becoming a primary
power source and an increase in extreme weather have led to more power supply variability. At the same time,
fluctuating data center utilization caused by mobile user churn increases power demand variability. As data
centers increasingly oversubscribe power infrastructure in order to limit cost, power demand and supply have
to be kept in balance at increasingly short timescales. NextG applications are often deployed in edge data
centers, where large-scale power supply redundancy and power storage are often out of reach, and there is
little to no control over the shared power distribution infrastructure. Thus, the internal resiliency of data
centers to power events is paramount for the availability of nextG applications.
We are building and evaluating a data center power control plane (PCP). PCP can control power demand at a fine
granularity and over short timescales by making it software-defined. The key is to gracefully trade off power
and quality of service over time. This allows PCP to shed or consolidate load to less power-intensive
processors
to conserve power during a power event. We expect that PCP will reduce recovery latency from power disasters
from weeks to seconds, outlive long, multi-day power disruptions, and improve the energy efficiency of nextG
data centers by at least 10x via power oversubscription.
As a foundational component of PCP, we have developed a prototype power resilient distributed file system that
establishes the viability of the idea, leading the way into our proposed research. Our prototype provides
low-latency fail-over and recovery for load shedding events enabling fine-grained load control by PCP. At the
same time, our work raises important research questions that must be answered for PCP to be practical, such as
how to scale power demand and supply instrumentation and control; how to leverage the diverse power envelopes
of
various processing devices available in a data center; how to determine sheddable load and how to control it
transparently; how to further reduce fail-over and recovery latency from power outages.
As an example, we have developed a prototype SmartNIC offload of Assise. We evaluate this prototype’s availability with a reboot experiment. We run Varmail of the Filebench suite on a primary DFS replica. The prototype replicates the per-process log data to two replicas, replica-1 and replica-2. While running Varmail, we reboot replica-1’s host and inform its NICFS of the lost host CPUs. The graph demonstrates that a SmartNIC side NICFS can indeed assume DFS service when the host CPU is powered down and provide enhanced availability for critical nextG applications in power-limited scenarios. |
People
Faculty
Tom Anderson
Simon Peter
Postdoc
Jonggyu Park
Acknowledgements
This material is based upon work supported by the National Science Foundation under grant no. 2148209 and is supported in part by funds from federal agency and industry partners as specified in the Resilient & Intelligent NextG Systems (RINGS) program.